(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This tutorial was generated from a Jupyter notebook. You can download the notebook here.
import numpy as np
import pandas as pd
# Plotting modules and settings.
import matplotlib.pyplot as plt
import seaborn as sns
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728',
'#9467bd', '#8c564b', '#e377c2', '#7f7f7f',
'#bcbd22', '#17becf']
sns.set(style='whitegrid', palette=colors, rc={'axes.labelsize': 16})
# The following is specific Jupyter notebooks
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}
In the last lesson, we learned about Pandas and dipped our toe in to see its power. In this lesson, we will work with a more complicated data set and use Pandas to handle it and pull out what we need.
The data set comes from Kleinteich and Gorb, Sci. Rep., 4, 2014, and was featured in the New York Times. They measured several properties about the tongue strikes of horned frogs. Let's take a look at the data set, which is in the file ~git/data/frog_tongue_adhesion.csv
.
The output of !head -n 20 data/frog_tongue_adhesion.csv
:
# These data are from the paper,
# Kleinteich and Gorb, Sci. Rep., 4, 5225, 2014.
# It was featured in the New York Times.
# http://www.nytimes.com/2014/08/25/science/a-frog-thats-a-living-breathing-pac-man.html
#
# The authors included the data in their supplemental information.
#
# Importantly, the ID refers to the identifites of the frogs they tested.
# I: adult, 63 mm snout-vent-length (SVL) and 63.1 g body weight,
# Ceratophrys cranwelli crossed with Ceratophrys cornuta
# II: adult, 70 mm SVL and 72.7 g body weight,
# Ceratophrys cranwelli crossed with Ceratophrys cornuta
# III: juvenile, 28 mm SVL and 12.7 g body weight, Ceratophrys cranwelli
# IV: juvenile, 31 mm SVL and 12.7 g body weight, Ceratophrys cranwelli
date,ID,trial number,impact force (mN),impact time (ms),impact force / body weight,adhesive force (mN),time frog pulls on target (ms),adhesive force / body weight,adhesive impulse (N-s),total contact area (mm2),contact area without mucus (mm2),contact area with mucus / contact area without mucus,contact pressure (Pa),adhesive strength (Pa)
2013_02_26,I,3,1205,46,1.95,-785,884,1.27,-0.290,387,70,0.82,3117,-2030
2013_02_26,I,4,2527,44,4.08,-983,248,1.59,-0.181,101,94,0.07,24923,-9695
2013_03_01,I,1,1745,34,2.82,-850,211,1.37,-0.157,83,79,0.05,21020,-10239
2013_03_01,I,2,1556,41,2.51,-455,1025,0.74,-0.170,330,158,0.52,4718,-1381
2013_03_01,I,3,493,36,0.80,-974,499,1.57,-0.423,245,216,0.12,2012,-3975
The first lines all begin with #
signs, signifying that they are comments and not data. They do give important information, though, such as the meaning of the ID data. The ID refers to which specific frog was tested.
Immediately after the comments, we have a row of comma-separated headers. This row sets the number of columns in this data set and labels the meaning of the columns. So, we see that the first column is the date of the experiment, the second column is the ID of the frog, the third is the trial number, and so on.
After this row, each row represents a single experiment where the frog struck the target. So, these data are already in tidy format. Let's go ahead and load the data into a DataFrame
.
# Load the data
df = pd.read_csv('data/frog_tongue_adhesion.csv', comment='#')
# Take a look
df
We can now access various subsets of the data. For example, let's say we are only interested in strong strikes, i.e., those with an impact force above one Newton. We just use Boolean slicing to get that out using .loc
.
# Slice out big forces
df_big_force = df.loc[df['impact force (mN)'] > 1000, :]
# Look at it
df_big_force
Notice that the indices of the individual measurements did not change! This newly formed DataFrame
does not have an index 4, for example. You can think of the indices as labels on experiments/observations, not as the ordering in an array. In fact, ordering is not really relevant at all in a DataFrame
(except if the indices are somehow ordered, which is often the case when using DataFrame
s for time series data; the indices are time points).
We can also select a single experiment (very convenient that the DataFrame
is tidy!). As we learned last time, we use .loc
.
df.loc[42, :]
Conveniently, we get all the data and labels we need.
We can select multiple columns by giving a list of column headers.
df.loc[:, ['impact force (mN)', 'adhesive force (mN)']]
We can also do Boolean indexing using .loc
. Say we want Frog I's impact force and adhesive force.
df.loc[df['ID']=='I', ['impact force (mN)', 'adhesive force (mN)']]
This usage, with Boolean indexing and data selection is probably the most often used technique with Pandas. At least it is in my workflow.
Now that we know how to slice out columns, we can start to find correlations. For example, we might think that the impact force and the adhesive strength might be correlated. Let's make a plot to check.
fig, ax = plt.subplots(1, 1)
ax.set_xlabel('impact force (mN)')
ax.set_ylabel('adhesive force (mN)')
_ = ax.plot(df['impact force (mN)'], df['adhesive force (mN)'], marker='.',
linestyle='none')
There does, in fact, seem to be some correlation. We could try any pair. As a trick to allow fast plotting, we will use the DataFrame
's built-in plot()
method, that allows us to quickly make a plot with the axes already labeled.
df.plot(x='total contact area (mm2)', y='adhesive force (mN)', kind='scatter');
There are other slick methods built in to DataFrames
. For example, we can quickly compute the Pearson correlation between all pairs of data with the corr()
method of the DataFrame
.
df.corr()
It is getting kind of cumbersome indexing with the long column headings. The headings are nonetheless useful, since they are descriptive and we never have problems losing track of units. But, let's say we wanted change impact force (mN)
to impf
for easier indexing. We can use the rename()
method of DataFrame
s. We just pass in a dictionary of columns we want to rename.
# Rename the impact force column
df = df.rename(columns={'impact force (mN)': 'impf'})
We will explore the power of Pandas more in the next lesson, when we practice with DataFrame
s.