This tutorial was generated from a Jupyter notebook. You can download the notebook here.
# NumPy of course
import numpy as np
# Pandas, conventionally imported as pd
import pandas as pd
# This is how we import the module of Matplotlib we'll be using
import matplotlib.pyplot as plt
# Seaborn makes plots pretty!
import seaborn as sns
# PyBeeswarm for some better data display
import beeswarm
# Magic function to make matplotlib inline; other style specs must come AFTER
%matplotlib inline
# This enables SVG graphics inline (only use with static plots (non-Bokeh))
%config InlineBackend.figure_format = 'svg'
# Set JB's favorite Seaborn settings
rc={'lines.linewidth': 2, 'axes.labelsize': 18, 'axes.titlesize': 18,
'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)
In the last lesson, we learned about Pandas and dipped our toe in to see its power. In this lesson, we will work with a more complicated data set and use Pandas to handle it and pull out what we need.
# Use ! to invoke a shell command. Use head to look at top 20 lines of file.
!head -n 20 ../data/frog_tongue_adhesion.csv
The first lines all begin with # signs, signifying that they are comments and not data. They do give important information, though, such as the meaning of the ID data. The ID refers to which specific frog was tested.
Immediately after the comments, we have a row of comma-separated headers. This row sets the number of columns in this data set and labels the meaning of the columns. So, we see that the first column is the date of the experiment, the second column is the ID of the frog, the third is the trial number, and so on.
After this row, each row repesents a single experiment where the frog struck the target. So, these data are already in tidy format. Let's go ahead and load the data into a DataFrame
.
# Load the data
df = pd.read_csv('../data/frog_tongue_adhesion.csv', comment='#')
# Take a look
df
We can now access various subsets of the data. For example, let's say we are only interested in strong stikes, i.e., those with an impact force above one Newton. We just use Boolean slicing to get that out.
# Slice out big forces
df_big_force = df[df['impact force (mN)'] > 1000]
# Look at it
df_big_force
Notice that the indices of the individual measurements did not change! This newly formed DataFrame
does not have an index 4, for example. You can think of the indices as labels on experiments/observations, not as the ordering in an array. In fact, ordering is not really relevant at all in a DataFrame
.
We can also select a single experiment (very convenient that the DataFrame
is tidy!). As we learned last time, we use the ix
attribute of the DataFrame
.
df.ix[42]
Conveniently, we get all the data and labels we need.
We can select multiple columns by giving a list of column headers.
df[['impact force (mN)', 'adhesive force (mN)']]
Now that we know how to slice out columns, we can start to find correlations. For example, we might think that the impact force and the adhesive strength might be correlated. Let's make a plot to check.
# Using Pandas automatically gives axis labels
df.plot(x='impact force (mN)', y='adhesive force (mN)', kind='scatter')
There does, in fact, seem to be some correlation. We could try any pair.
df.plot(x='total contact area (mm2)', y='adhesive force (mN)', kind='scatter')
We can quickly compute the Pearson correlation between all pairs of data with the corr()
method of the DataFrame
.
df.corr()
It is getting kind of cumbersome indexing with the long column headings. The headings are nonetheless useful, since they are descriptive and we never have problems losing track of units. But, let's say we wanted change impact force (mN)
to impf
for easier indexing. We can use the rename()
method of DataFrame
s. We just pass in a dictionary of columns we want to rename.
# Rename the impact force column
df = df.rename(columns={'impact force (mN)': 'impf'})
Instead of pooling all the data together, we might want to look at each frog. We'll create a beeswarm plot of the impact forces generated from each of the four frogs.
# Separate impact forces by frog
list_of_impfs = [df['impf'][df.ID=='I'], df['impf'][df.ID=='II'],
df['impf'][df.ID=='III'], df['impf'][df.ID=='IV']]
# Generate a beeswarm plot
beeswarm.beeswarm(list_of_impfs, labels=['I', 'II', 'III', 'IV'])
plt.ylabel('impact force (mN)')
We see that frog I was a big hitter, but is pretty inconsistent. The juvenile frogs, III and IV, strike with much less force. They're much smaller, and are actually even a different species.
This plot is very simple, and using Pandas DataFrame
s made it very easy to generate.
We may want to check if frog I is moody from day to day. So, we can plot the impact force as a function of date. We first have to convert the strings like '2013_02_26'
to dates that Pandas recognized.
# Convert to dates
df['date'] = pd.to_datetime(df['date'], format='%Y_%m_%d')
# Make a plot for frog I of impact force vs date
df[df['ID']=='I'].plot(x='date', y='impf', kind='line', marker='o', linestyle='',
legend=False)
plt.ylabel('impact force (mN)')
So, there does not appear to be any correlation between what day frog I strikes and how hard he or she strikes. Remember, it was as simple as using the df[df['ID']=='I']
syntax to pull out these data! Pandas is a powerful ally.