By Alex Olteanu, Student Success Specialist at Dataquest.io
And, at the end of the tutorial, arrive here:
To follow along, you'll need at least some basic knowledge of Python. If you know what's the difference between methods and attributes, then you're good to go.
Introducing the dataset
import pandas as pd direct_link = 'http://www.randalolson.com/wp-content/uploads/percent-bachelors-degrees-women-usa.csv' women_majors = pd.read_csv(direct_link) print(women_majors.info()) women_majors.head()
Year column, every other column name indicates the subject of a Bachelor degree. Every datapoint in the Bachelor columns represents the percentage of Bachelor degrees conferred to women. Thus, every row describes the percentage for various Bachelors conferred to women in a given year.
As mentioned before, we have data from 1970 to 2011. To confirm the latter limit, let's print the last five rows of the dataset by using the
The context of our FiveThirtyEight graph
Almost every FTE graph is part of an article. The graphs complement the text by illustrating a little story, or an interesting idea. We'll need to be mindful of this while replicating our FTE graph.
To avoid digressing from our main task in this tutorial, let's just pretend we've already written most of an article about the evolution of gender disparity in US education. We now need to create a graph to help readers visualize the evolution of gender disparity for Bachelors where the situation was really bad for women in 1970. We've already set a threshold of 20%, and now we want to graph the evolution for every Bachelor where the percentage of women graduates was less than 20% in 1970.
Let's first identify those specific Bachelors. In the following code cell, we will:
.loc, a label-based indexer, to:
- select the first row (the one that corresponds to 1970);
- select the items in the first row only where the values are less than 20; the
Yearfield will be checked as well, but will obviously not be included because 1970 is much greater than 20.
- Assign the resulting content to
Using matplotlib's default style
Let's begin working on our graph. We'll first take a peek at what we can build by default. In the following code block, we will:
- Run the Jupyter magic
%matplotlibto enable Jupyter and matplotlib work together effectively, and add
inlineto have our graphs displayed inside the notebook.
- Plot the graph by using the
women_majors. We pass in to
plot()the following parameters:
x- specifies the column from
women_majorsto use for the x-axis;
y- specifies the columns from
women_majorsto use for the y-axis; we'll use the index labels of
under_20which are stored in the
.indexattribute of this object;
figsize- sets the size of the figure as a
tuplewith the format
(width, height)in inches.
- Assign the plot object to a variable named
under_20_graph, and print its type to show that pandas uses
matplotlibobjects under the hood.
Using matplotlib's fivethirtyeight style
The graph above has certain characteristics, like the width and color of the spines, the font size of the y-axis label, the absence of a grid, etc. All of these characteristics make up matplotlib's default style.
As a short parenthesis, it's worth mentioning that we'll use a few technical terms about the parts of a graph throughout this post. If you feel lost at any point, you can refer to the legend below.
Besides the default style, matplotlib comes with several built-in styles that we can use readily. To see a list of the available styles, we will:
- Import the
matplotlib.stylemodule under the name
- Explore the content of
matplotlib.style.available(a predefined variable of this module), which contains a list of all the available in-built styles.
You might have already observed that there's a built-in style called
fivethirtyeight. Let's use this style, and see where that leads. For that, we'll use the aptly named
use() functionfrom the same
matplotlib.style module (which we imported under the name
style). Then we'll generate our graph using the same code as earlier.
Wow, that's a major change! With respect to our first graph, we can see that this one has a different background color, it has grid lines, there are no spines whatsoever, the weight and the font size of the major tick labels are different, etc.
You can read a technical description of the
fivethirtyeight style here - it should also give you a good idea about what code runs under the hood when we use this style. The author of the style sheet, Cameron David-Pilon, discusses some of the characteristics here.
For more on generating FiveThirtyEight graphs in Python, see the rest of the original article here.
Bio: Alex Olteanu is a Student Success Specialist at Dataquest.io. He enjoys learning and sharing knowledge, and is getting ready for the new AI revolution.
Original. Reposted with permission.
- Analyzing the Migration of Scientific Researchers
- 7 Techniques to Visualize Geospatial data
- The Python Graph Gallery