(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This tutorial was generated from a Jupyter notebook. You can download the notebook here.
import numpy as np
import pandas as pd
import bootcamp_utils
# Plotting modules and settings.
import matplotlib.pyplot as plt
import seaborn as sns
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728',
'#9467bd', '#8c564b', '#e377c2', '#7f7f7f',
'#bcbd22', '#17becf']
sns.set(style='whitegrid', palette=colors, rc={'axes.labelsize': 16})
# The following is specific Jupyter notebooks
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}
Now that we have some experience with Pandas, we can use Pandas DataFrame
s along with Seaborn for more plotting applications. We'll start be looking at the striking forces of the frogs we studied last time.
# Load the data
df = pd.read_csv('data/frog_tongue_adhesion.csv', comment='#')
Let's say we want to make a bar graph of the impact forces of the four frogs we tested. First we'll do it by selecting items from the DataFrame
and hand-building a bar graph using Matplotlib.
If we want the mean impact force for each frog, we can select the frog and computed the mean.
# Mean impact force of frog I
np.mean(df.loc[df['ID']=='I', 'impact force (mN)'])
So, let's write a for
loop to compute all of these means an standard error of the mean.
mean_impf = np.empty(4)
sem_impf = np.empty(4)
for i, frog in enumerate(df['ID'].unique()):
mean_impf[i] = np.mean(df.loc[df['ID']==frog, 'impact force (mN)'])
n = np.sum(df['ID']==frog)
sem_impf[i] = np.std(df.loc[df['ID']==frog, 'impact force (mN)']) / np.sqrt(n)
print(mean_impf)
print(sem_impf)
We can compute the means and SEMs of all four frogs at once using the groupby()
method of a DataFrame
, and then using the mean()
and sem()
methods of the DataFrameGroupBy
object.
gb_frog = df.groupby('ID')
mean_impf = gb_frog['impact force (mN)'].mean()
sem_impf = gb_frog['impact force (mN)'].sem()
print(mean_impf)
print(sem_impf)
The differences in the SEMs calculated by the different methods are due to slight differences in how the standard deviations are calculated (sample vs population standard deviations).
Now that we have our mean and SEM, we can plot a bar graph using Matplotlib. We need to specify the left edges of the bars and their heights. We have to pass two kwargs, yerr=sem_impf
to get the error bars, and tick_label=['I', 'II', 'III', 'IV']
to specify the names of the frogs.
fig, ax = plt.subplots(1, 1)
ax.set_ylabel('impact force (nM)')
# Turn off grid lines for x-axis
ax.grid(False, axis='x')
_ = ax.bar(np.arange(4), mean_impf, yerr=sem_impf, tick_label=['I', 'II', 'III', 'IV'])
If you have your data in a tidy DataFrame
, Seaborn makes a lot of this plotting much easier for you. Watch this one-liner (with a extra lines to get the axis labels right).
ax = sns.barplot(data=df, x='ID', y='impact force (mN)')
ax.set_xlabel('')
ax.set_ylabel('impact force (mN)');
Yep, those for
loops (or groupby
s) were not necessary. We just specify the x
and y
values and the DataFrame
that is the source of our data, and Seaborn does the rest, provided, of course, that your DataFrame
is tidy, even getting the grid lines right. It is a powerful package for visualization of data using Matplotlib.
Bar graphs, especially with error bars, are typically awful. They are pervasive in biology papers. I have yet to find a single example where a bar graph is the best choice. Bee swarm plots or even box plots, shown below, are more informative and almost always preferred. In fact, ECDFs are often better even than these. Whether you use bee swarm plots or ECDFs, here is a simple message:
What should I do instead you ask? Well, read on, my loyal bootcamper!
When you can, plot all of your data. Bee swarm plots are a great way to do this. They are best seen by example. The syntax is exactly the same as for the bar graphs.
ax = sns.swarmplot(data=df, x='ID', y='impact force (mN)')
ax.set_xlabel('')
ax.set_ylabel('impact force (mN)');
Ah, now we see all of the data. Very useful! We can see the spread in Frog I, and we can see that Frog II is capable of hitting hard, but generally is lazy.
Now, we might be concerned that the frogs are hitting with different impact forces on different days. We can shade the days that with the hue
kwarg. We color each point in the beeswarm plot with the date of the measurement.
ax = sns.swarmplot(data=df, x='ID', y='impact force (mN)', hue='date')
ax.set_xlabel('')
ax.set_ylabel('impact force (mN)')
# Remove the legend, it is too large and cumbersome
ax.legend_.remove()
There does not seem to be any particular trend in the data depending on which day the measurements were taken. The frogs do not seem to care.
Sometimes we have too many points to plot a bee swarm plot. When this is the case, a box plot is a good alternative (though I generally prefer just plotting all of the ECDFs). That is not the case here, but we'll make a box plot anyhow to demonstrate how it works. Actually, the syntax is exactly the same as for bee swarm plots.
ax = sns.boxplot(data=df, x='ID', y='impact force (mN)')
ax.set_xlabel('')
ax.set_ylabel('impact force (mN)');
Box plots provide a summary of the data, giving substantially more information than a bar graph. The line in the middle of the box is the median and the top and bottom of the box at the 75th and 25th percentiles, respectively. The distance between the 25th and 75th percentiles is called the interquartile range, or IQR. The whiskers of the box plot extend a distance equal to 1.5 times the interquartile range, or to the extent of the data, whichever is least extreme. If data points are more extreme, they are shown individually, and are often referred to as outliers.
I think you now see that Seaborn enables you to make beautiful and informative graphics without too much pain. Enjoy using it in your research!
I actually prefer just plotting the ECDFs for each to make comparisons. Seaborn does not have a built-in function to do this, but we can write our own. We will use the ECDF plotting function we wrote in Exercise 3.3. I'll assume you placed it in your bootcamp_utils
module and have it available.
# Import ecdf from bootcamp_utils so you can copy and paste
# function below into your bootcamp_utils module and also use it here
from bootcamp_utils import ecdf
def ecdf_plot(data, value, hue=None, formal=False, buff=0.1, min_x=None, max_x=None,
ax=None):
"""
Generate `x` and `y` values for plotting an ECDF.
Parameters
----------
data : Pandas DataFrame
Tidy DataFrame with data sets to be plotted.
value : column name of DataFrame
Name of column that contains data to make ECDF with.
hue : column name of DataFrame
Name of column that identifies labels of data. A seperate
ECDF is plotted for each unique entry.
formal : bool, default False
If True, generate `x` and `y` values for formal ECDF.
Otherwise, generate `x` and `y` values for "dot" style ECDF.
buff : float, default 0.1
How long the tails at y = 0 and y = 1 should extend as a
fraction of the total range of the data. Ignored if
`formal` is False.
min_x : float, default None
Minimum value of `x` to include on plot. Overrides `buff`.
Ignored if `formal` is False.
max_x : float, default None
Maximum value of `x` to include on plot. Overrides `buff`.
Ignored if `formal` is False.
ax : matplotlib Axes
Axes object to draw the plot onto, otherwise makes a new
figure/axes.
Returns
-------
output : matplotlib Axes
Axes object containg ECDFs.
"""
# Set up axes
if ax is None:
fig, ax = plt.subplots(1, 1)
ax.set_xlabel(str(value))
ax.set_ylabel('ECDF')
if hue is None:
x, y = ecdf(data[value], formal=formal, buff=buff, min_x=min_x, max_x=max_x)
# Make plots
if formal:
_ = ax.plot(x, y)
else:
_ = ax.plot(x, y, marker='.', linestyle='none')
else:
gb = data.groupby(hue)
ecdfs = gb[value].apply(ecdf, formal=formal, buff=buff, min_x=min_x, max_x=max_x)
# Make plots
if formal:
for i, xy in ecdfs.iteritems():
_ = ax.plot(*xy)
else:
for i, xy in ecdfs.iteritems():
_ = ax.plot(*xy, marker='.', linestyle='none')
# Add legend
ax.legend(ecdfs.index, loc=0)
return ax
ax = ecdf_plot(df, 'impact force (mN)', hue='ID')
Or, we could do it with formal ECDFs, which sometimes are easier to read when you have many data sets on the same plot.
ax = ax = ecdf_plot(df, 'impact force (mN)', hue='ID', formal=True)
I suggest adding ecdf_plot()
to your bootcamp_utils
so you have it. Be sure to include
import matplotlib.pyplot
at the top of your bootcamp_utils.py
file when you to put ecdf_plot()
in there.