Lesson 21: Introduction to Matplotlib: plotting a histogram

(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

In [1]:
import numpy as np

# This is how we import the module of Matplotlib we'll be using
import matplotlib.pyplot as plt

# The following is specific Jupyter notebooks
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}

# In our IPython terminal do:
# %matplotlib

We had a nice data set in the last tutorial, the cross-sectional areas of eggs from different mothers with different feeding conditions. While it is instructive to look at values, such as medians, etc., computed from NumPy arrays with the data, we would of course like to plot the results. The Matplotlib package is the central plotting software in the SciPy stack. In this lesson, we will explore its capabilities and API. Seaborn is another great package that allows for nice formatting of Matplotlib plots, in addition to other capabilies, that you should have installed in Lesson 0, and we will cover in more depth in a later lesson.

Importantly, there are many other great Python-based plotting tools. In particular, Bokeh makes beautiful, interactive, browser-based plots. It is my plotting software of choice.

Ok, let's start using Matplotlib!

Importing Matplotlib

Most of the plotting you will do, and all of the plotting we do in bootcamp, will use Matplotlib's pyplot module. Like NumPy, this module is pervasive, and the custom is to import it like this:

import matplotlib.pyplot as plt

This is what we did at the beginning of this lesson. For some special types of plotting, you will need to import other Matplotlib modules, but we will not do that in the bootcamp.

We will also be making heavy use of Seaborn, always for styling. Seaborn is customarily imported as sns, or

import seaborn as sns

We will import it later in this tutorial to compare plot styling with and without Seaborn's defaults.

First example: making a histogram

Let's load in our egg cross-sectional area data again.

In [2]:
# Load in data
xa_high = np.loadtxt('data/xa_high_food.csv', comments='#')
xa_low = np.loadtxt('data/xa_low_food.csv', comments='#')

For our first plot, we'll make a histogram using Matplotlib.

In [3]:
# Set up a figure with set of axes
fig, ax = plt.subplots(1, 1)

# Add axis labels
ax.set_xlabel('Cross-sectional area (µm$^2$)')
ax.set_ylabel('count');

# Generate the histogram for the low-density fed mother
_ = ax.hist(xa_low)

# You need this if you didn't use %matplotlib in IPython shell
# plt.show()

Before we go through the plotting commands, I want to focus on how we get the plot to show. If you are working interactively in an IPython shell, which you often are for plotting, since you are exploring data, you definitely want to make the plotting windows interactive. In your IPython shell, do this:

%matplotlib

This magic functions lets you have multiple plots open at once and also allows you to continue using IPython while a plot is being displayed. If you do not do this option, you need the last line, plt.show(). This tells Matplotlib to render the plot you made in an interactive window. The window will stay open, and your IPython session frozen.

Now, let's go through the plotting line-by-line. Our strategy is to set up a set of axes in a figure, and then populate those axes with graphical representations of our data. The first line sets up the figure you want to create.

fig, ax = plt.subplots(1, 1)

The plt.subplots() function returns a Figure object which contains the graphics you want to create as well as an AxesSubplot object, which contains the axes you want to plot on. By choosing arguments 1 and 1 in our call to plt.subplots(), we are specifying that we want a 1$\times$1 grid of subplots, i.e., a single plot. Importantly, the plt.subplots() function also has a figsize kwarg, which allows you to set the size of the figure you are generating. We will see how to use this in the next tutorial.

Next, we label the axes using the set_xlabel() and set_ylabel() methods of the AxesSubplot object.

ax.set_xlabel('Cross-sectional area (µm$^2$)')
ax.set_ylabel('count');

The dollar signs around the ^2 tell Matplotlib that we invoke $\LaTeX$ to render the string. $\LaTeX$ is a type-setting program that is very useful for displaying mathematical equations.

Now that we have the "canvas" on which we will "paint" our data set, up, we can choose the representation of our data. For this example, we will make a histogram.

_ = plt.hist(xa_low)

We put the "underscore equals" in front of the function call because plt.hist() returns a tuple of NumPy arrays containing the bins and counts for the histogram. Because it is of no use to us, we just assign it the dummy variable _. The argument is the data set we want to compute and plot a histogram for.

Tweaking the defaults

I would argue that the bars should lie at tick marks to make things more clear. I.e., we should bin the data to be between 1700 and 1750, 1750 and 1800, and so on. We can specify the bins we want in our call to plt.hist(). The np.arange() function helps with this.

In [4]:
# Make bin boundaries
bins = np.arange(1700, 2501, 50)

Remember, np.arange(start, stop, stride) generates evenly spaced points going from start to stop (exclusively, just like indexing) with a stride of stride. We can then use these bin boundaries with the bins kwarg. If stride is not given, the default is stride = 1. If only one argument is given, start is assumed to be zero and stride is assumed to be 1.

In [5]:
# Set up a figure with set of axes
fig, ax = plt.subplots(1, 1)

# Add axis labels
ax.set_xlabel('Cross-sectional area (µm$^2$)')
ax.set_ylabel('count');

# Generate the histogram for the low-density fed mother
_ = ax.hist(xa_low, bins=bins)

Note that we re-instantiated the Figure and AxesSubplot instances. We could have cleared the data off of the plot and replotted the histogram using the ax.cla() method, which clears the axes. However, this clears everything off the axes, including the axis labels, so we might as well just re-instantiate.

Using Seaborn to make it look pretty

Seaborn is a useful package for making plots look pretty (and also for doing much more, like making statistical plots). To invoke Seaborn, we need to import it. We can then set the properties of the plots using sns.set(). Henceforth, we will do these things at the beginning of our lessons.

In [6]:
import seaborn as sns
sns.set()

# Set up a figure with set of axes
fig, ax = plt.subplots(1, 1)

# Add axis labels
ax.set_xlabel('Cross-sectional area (µm$^2$)')
ax.set_ylabel('count');

# Generate the histogram for the low-density fed mother
_ = ax.hist(xa_low, bins=bins)

Much nicer! I actually think the axis labels are a big small, especially if they are going to be used in a talk, so I like to set them larger. Furthermore, while I do like the dark background, I have actually come to prefer a white background. I also really like the default color cycle of Matplotlib 2.0, which is overridden by Seaborn. So, I like to set things up to make plots to my liking using sns.set()'s highly configurable options.

In [7]:
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728',
          '#9467bd', '#8c564b', '#e377c2', '#7f7f7f',
          '#bcbd22', '#17becf']
sns.set(style='whitegrid', palette=colors, rc={'axes.labelsize': 16})

# Set up a figure with set of axes
fig, ax = plt.subplots(1, 1)

# Add axis labels
ax.set_xlabel('Cross-sectional area (µm$^2$)')
ax.set_ylabel('count');

# Generate the histogram for the low-density fed mother
_ = ax.hist(xa_low, bins=bins)

That's better! Going forward, we will always have the Seaborn settings above set at the very beginning of the lessons. Note that you can configure the aesthetics of your plots to your own liking. These are just my preferences.

Plotting the two histograms together

It would be nice to compare the histograms of the to data sets, cross-sectional areas for high food and for low food. This is done quite intuitively; we simply pass a tuple containing the two Numpy arrays we want made into a paired histogram.

In [8]:
# Reset bins, since xa_low has smaller values
bins = np.arange(1600, 2501, 50)

# Set up a figure with set of axes
fig, ax = plt.subplots(1, 1)

# Add axis labels
ax.set_xlabel('Cross-sectional area (µm$^2$)')
ax.set_ylabel('count');

# Generate the histogram for the low-density fed mother
_ = ax.hist((xa_low, xa_high), bins=bins)

# Add a legend
ax.legend(('low', 'high'), loc='upper right');

We passed two arrays into plt.hist() as a tuple, and it automatically made the two histograms. Notice that we also added a legend to the upper corner using plt.legend().

I actually think this style of displaying a histogram is hideous. The bins have now become ambiguous. Here's how I would do it (not really, though, as we'll see in later exercises).

In [10]:
# Set up a figure with set of axes
fig, ax = plt.subplots(1, 1)

# Add axis labels
ax.set_xlabel('Cross-sectional area (µm$^2$)')
ax.set_ylabel('count');

# Generate the histogram for the low-density fed mother
_ = ax.hist(xa_low, normed=True, bins=bins, alpha=0.5)
_ = ax.hist(xa_high, normed=True, bins=bins,alpha=0.5)

# Add a legend
plt.legend(('low', 'high'), loc='upper right');

The bin boundaries are again clear. I set the kwarg alpha to 0.5, which says I want the fill of the histogram to only 50% opaque to allow visualization of the overlap of the histogams. I also set normed=True to normalize the histogram, since there were an unequal number of eggs for the low and high-fed mothers.

There are many plotting options!

One of the main aims I had in this example is that there are many many options available for making plots and stylizing them through various functions and kwargs. The Matplotlib wesbite is chuck full of examples and good documentation. You should refer to it extensively as you prepare your plots!

Saving figures

Of course, just displaying your figures is not enough. You will want to put them in documents! So, you need to save your figure. In general, you should save your figures as vector graphics and not raster graphics. (There are specific instances where raster graphics are appropriate, but for most applications for static presentation of scientific data, vector graphics are better.)

Two common vector graphics formats that Matplotlib can write out are SVG (scalable vector graphics) and PDF (portable document format). To save a figure in a file named fig.pdf, the syntax is as simple as

plt.savefig('fig.pdf')

Similarly to save an SVG, it's simply

plt.savefig('fig.svg')

JB's personal view

Actually, my view is that vector graphics are also not ideal, at least not for plotting any substantial amount of data. Interactive graphics are much preferred, and we now have the technology to do that. You have interacted with your Matplotlib plots via plotting windows, and for many data sets and/or functions, this is very useful. At the least, you can zoom in on data.

There are packages that allow you to do this, and more, in a web browser. You can also do things like have hover-over information. You may have seen this kind of interactive graphic in places like the New York Times; data journalists have been doing this for years. I think this is the natural extension for scientific plotting. In the future as well, I think the PDF format for scientific papers will die and things will become interactive.

My favorite package for generating this sort of thing is Bokeh. If this really excites you, you can play with Bokeh in the exercises tonight.