(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This tutorial was generated from a Jupyter notebook. You can download the notebook here.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
rc={'lines.linewidth': 2, 'axes.labelsize': 18, 'axes.titlesize': 18}
sns.set(rc=rc)
# The following is specific Jupyter notebooks
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}
Now that we have some experience with Pandas, we can use Pandas DataFrame
s along with Seaborn for more plotting applications. We'll start be looking at the striking forces of the frogs we studied last time.
We'll load the data and rename the 'impact force (mN)
column to be impf
for convenience.
# Load the data
df = pd.read_csv('data/frog_tongue_adhesion.csv', comment='#')
# Rename impact force column
df = df.rename(columns={'impact force (mN)': 'impf'})
Let's say we want to make a bar graph of the impact forces of the four frogs we tested. First we'll do it by selecting items from the DataFrame
and hand-building a bar graph using Matplotlib.
If we want the mean impact force for each frog, we can select the frog and computed the mean.
# Mean impact force of frog I
np.mean(df.loc[df['ID']=='I', 'impf'])
So, let's write a for
loop to compute all of these means an standard error of the mean.
mean_impf = np.empty(4)
sem_impf = np.empty(4)
for i, frog in enumerate(['I', 'II', 'III', 'IV']):
mean_impf[i] = np.mean(df.loc[df['ID']==frog, 'impf'])
n = np.sum(df['ID']=='I')
sem_impf[i] = np.std(df.loc[df['ID']==frog, 'impf']) / np.sqrt(n)
print(mean_impf)
print(sem_impf)
We can compute the means and SEMs of all four frogs at once using the groupby()
method of a DataFrame
, and then using the mean()
and sem()
methods of the DataFrameGroupBy
object. I show it here, but we will not go into these more advanced Pandas features in the bootcamp.
gb_frog = df.groupby('ID')
mean_impf = gb_frog['impf'].mean()
sem_impf = gb_frog['impf'].sem()
print(mean_impf)
print(sem_impf)
The differences in the SEMs are due to slight differences in how the standard deviations are calculated (sample vs population standard deviations).
Now that we have our mean and SEM, we can plot a bar graph using Matplotlib. We need to specify the left edges of the bars and their heights. We have to pass lots of kwargs to make the plot look reasonable, like yerr=sem_impf
to get the error bars, ecolor='black'
to make the error bars black, tick_label=['I', 'II', 'III', 'IV']
to specify the names of the frogs, and align='center'
to align the tick labels to the center of the bars.
plt.bar(np.arange(4), mean_impf, yerr=sem_impf, ecolor='black',
tick_label=['I', 'II', 'III', 'IV'], align='center')
plt.ylabel('impact force (mN)')
If you have your data in a tidy DataFrame
, Seaborn makes a lot of this plotting much easier for you. Watch this one-liner (with a extra lines to get the axis labels right).
sns.barplot(data=df, x='ID', y='impf')
plt.xlabel('')
plt.ylabel('impact force (mN)')
Yep, those for
loops (or groupby
s) were not necessary. We just specify the x
and y
values and the DataFrame
that is the source of our data, and Seaborn does the rest. It is a powerful package for visualization of data using Matplotlib.
Bar graphs, especially with error bars, are typically awful. They are pervasive in biology papers. I have yet to find a single example where a bar graph is the best choice. Bee swarm plots or box plots, shown below, are more informative and almost always preferred. Here is a simple message:
What should I do instead you ask? Well, read on, my loyal bootcamper!
When you can, plot all of your data. Bee swarm plots are a great way to do this. They are best seen by example. The syntax is exactly the same as for the bar graphs.
sns.swarmplot(data=df, x='ID', y='impf')
plt.margins(0.02)
plt.xlabel('')
plt.ylabel('impact force (mN)')
Ah, now we see all of the data. Very useful! We can see the spread in Frog I, and we can see that Frog II is capable of hitting hard, but generally is lazy.
Now, we might be concerned that the frogs are hitting with different impact forces on different days. We can specify that with the hue
kwarg. We color each point in the beeswarm plot with the date of the measurement.
ax = sns.swarmplot(data=df, x='ID', y='impf', hue='date')
ax.legend_.remove()
plt.margins(0.02)
plt.xlabel('')
plt.ylabel('impact force (mN)')
There does not seem to be any particular trend in the data depending on which day the measurements were taken. The frogs do not seem to care.
Sometimes we have too many points to plot a bee swarm plot. When this is the case, a box plot is a good alternative. That is not the case here, but we'll make a box plot anyhow to demonstrate how it works. Actually, the syntax is exactly the same as for bee swarm plots.
sns.boxplot(data=df, x='ID', y='impf')
plt.margins(0.02)
plt.xlabel('')
plt.ylabel('impact force (mN)')
Box plots provide a nice summary of the data, giving substantially more information than a bar graph. The line in the middle of the box is the median and the top and bottom of the box at the 75th and 25th percentiles, respectively. The distance between the 25th and 75th percentiles is called the interquartile range, or IQR. The whiskers of the box plot extend a distance equal to 1.5 times the interquartile range, or to the extent of the data, whichever is least extreme. If data points are more extreme, they are shown individually, and are often referred to as outliers.
I think you now see that Seaborn enables you to make beautiful graphics without too much pain. Enjoy using it in your research!