Lesson 34: Seaborn and data display

(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
rc={'lines.linewidth': 2, 'axes.labelsize': 18, 'axes.titlesize': 18}
sns.set(rc=rc)

# The following is specific Jupyter notebooks
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}

Now that we have some experience with Pandas, we can use Pandas DataFrames along with Seaborn for more plotting applications. We'll start be looking at the striking forces of the frogs we studied last time.

We'll load the data and rename the 'impact force (mN) column to be impf for convenience.

In [2]:
# Load the data
df = pd.read_csv('data/frog_tongue_adhesion.csv', comment='#')

# Rename impact force column
df = df.rename(columns={'impact force (mN)': 'impf'})

Making a bar graph with Matplotlib

Let's say we want to make a bar graph of the impact forces of the four frogs we tested. First we'll do it by selecting items from the DataFrame and hand-building a bar graph using Matplotlib.

If we want the mean impact force for each frog, we can select the frog and computed the mean.

In [3]:
# Mean impact force of frog I
np.mean(df.loc[df['ID']=='I', 'impf'])
Out[3]:
1530.2

So, let's write a for loop to compute all of these means an standard error of the mean.

In [4]:
mean_impf = np.empty(4)
sem_impf = np.empty(4)
for i, frog in enumerate(['I', 'II', 'III', 'IV']):
    mean_impf[i] = np.mean(df.loc[df['ID']==frog, 'impf'])
    n = np.sum(df['ID']=='I')
    sem_impf[i] = np.std(df.loc[df['ID']==frog, 'impf']) / np.sqrt(n)
    
print(mean_impf)
print(sem_impf)
[ 1530.2    707.35   550.1    419.1 ]
[ 137.35063888   92.53359593   27.08485739   51.18749359]

We can compute the means and SEMs of all four frogs at once using the groupby() method of a DataFrame, and then using the mean() and sem() methods of the DataFrameGroupBy object. I show it here, but we will not go into these more advanced Pandas features in the bootcamp.

In [5]:
gb_frog = df.groupby('ID')
mean_impf = gb_frog['impf'].mean()
sem_impf = gb_frog['impf'].sem()

print(mean_impf)
print(sem_impf)
ID
I      1530.20
II      707.35
III     550.10
IV      419.10
Name: impf, dtype: float64
ID
I      140.918782
II      94.937466
III     27.788477
IV      52.517260
Name: impf, dtype: float64

The differences in the SEMs are due to slight differences in how the standard deviations are calculated (sample vs population standard deviations).

Now that we have our mean and SEM, we can plot a bar graph using Matplotlib. We need to specify the left edges of the bars and their heights. We have to pass lots of kwargs to make the plot look reasonable, like yerr=sem_impf to get the error bars, ecolor='black' to make the error bars black, tick_label=['I', 'II', 'III', 'IV'] to specify the names of the frogs, and align='center' to align the tick labels to the center of the bars.

In [6]:
plt.bar(np.arange(4), mean_impf, yerr=sem_impf, ecolor='black',
        tick_label=['I', 'II', 'III', 'IV'], align='center')
plt.ylabel('impact force (mN)')
Out[6]:
<matplotlib.text.Text at 0x11678a6d8>

Bar graphs with Seaborn

If you have your data in a tidy DataFrame, Seaborn makes a lot of this plotting much easier for you. Watch this one-liner (with a extra lines to get the axis labels right).

In [7]:
sns.barplot(data=df, x='ID', y='impf')
plt.xlabel('')
plt.ylabel('impact force (mN)')
Out[7]:
<matplotlib.text.Text at 0x11916bb38>

Yep, those for loops (or groupbys) were not necessary. We just specify the x and y values and the DataFrame that is the source of our data, and Seaborn does the rest. It is a powerful package for visualization of data using Matplotlib.

Don't make bar graphs

Bar graphs, especially with error bars, are typically awful. They are pervasive in biology papers. I have yet to find a single example where a bar graph is the best choice. Bee swarm plots or box plots, shown below, are more informative and almost always preferred. Here is a simple message:


Don't make bar graphs.

What should I do instead you ask? Well, read on, my loyal bootcamper!

Bee swarm plots

When you can, plot all of your data. Bee swarm plots are a great way to do this. They are best seen by example. The syntax is exactly the same as for the bar graphs.

In [8]:
sns.swarmplot(data=df, x='ID', y='impf')
plt.margins(0.02)
plt.xlabel('')
plt.ylabel('impact force (mN)')
Out[8]:
<matplotlib.text.Text at 0x1191f2208>

Ah, now we see all of the data. Very useful! We can see the spread in Frog I, and we can see that Frog II is capable of hitting hard, but generally is lazy.

Now, we might be concerned that the frogs are hitting with different impact forces on different days. We can specify that with the hue kwarg. We color each point in the beeswarm plot with the date of the measurement.

In [9]:
ax = sns.swarmplot(data=df, x='ID', y='impf', hue='date')
ax.legend_.remove()
plt.margins(0.02)
plt.xlabel('')
plt.ylabel('impact force (mN)')
Out[9]:
<matplotlib.text.Text at 0x119257eb8>

There does not seem to be any particular trend in the data depending on which day the measurements were taken. The frogs do not seem to care.

Box plots

Sometimes we have too many points to plot a bee swarm plot. When this is the case, a box plot is a good alternative. That is not the case here, but we'll make a box plot anyhow to demonstrate how it works. Actually, the syntax is exactly the same as for bee swarm plots.

In [10]:
sns.boxplot(data=df, x='ID', y='impf')
plt.margins(0.02)
plt.xlabel('')
plt.ylabel('impact force (mN)')
Out[10]:
<matplotlib.text.Text at 0x11939ce10>

Box plots provide a nice summary of the data, giving substantially more information than a bar graph. The line in the middle of the box is the median and the top and bottom of the box at the 75th and 25th percentiles, respectively. The distance between the 25th and 75th percentiles is called the interquartile range, or IQR. The whiskers of the box plot extend a distance equal to 1.5 times the interquartile range, or to the extent of the data, whichever is least extreme. If data points are more extreme, they are shown individually, and are often referred to as outliers.

I think you now see that Seaborn enables you to make beautiful graphics without too much pain. Enjoy using it in your research!