Lesson 31: Practice with Pandas

(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

In [1]:
import numpy as np
import pandas as pd

Pandas can be a bit frustrating during your first experiences with it. In this lesson, we will practice using Pandas. The more and more you use it, the more distant the memory of life without it will become.

Practice 1: Mastering .loc

We will again use the frog tongue adhesion data set. Your goal here is to extract certain entries out of the DataFrame. If it is not in your namespace, load in the DataFrame using pd.read_csv().

In [2]:
df = pd.read_csv('data/frog_tongue_adhesion.csv', comment='#')

a) Extract the impact time of all impacts that had an adhesive strength of magnitude greater than 2000 Pa.

b) Extract the impact force and adhesive force for all of Frog II's strikes.

c) Extract the adhesive force and the time the frog pulls on the target for juvenile frogs (Frogs III and IV).

Practice 2: The power of groupby()

In addition to his tidy data paper, Hadley Wickham also wrote an influential paper on the split-apply-combine idea. To quote him, "You see the split-apply-combine strategy whenever you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together." An example of a split-apply-combine operation on our present data set is this:

  1. Split the data up into four groups, one for each of the four frogs.
  2. Compute the mean impact force for each frog. Here, we are applying the mean function to the split data.
  3. Combine the results into a new set of four data points containing the mean impact force of each frog.

You can imagine that split-apply-combine operations happy very often in analyzing biological data.

The groupby() method of Pandas enables convenient split-apply-combine. It works by splitting up a DataFrame based on some criterion. Once that happens, we can then apply a function to these split up DataFrame. Upon application of the function, we get a recombined DataFrame with the result.

This is best shown by example. We'll compute mean impact force of each frog. First, though, let's do it the "long way."

  1. Extract all of Frog I's impact forces and compute the mean.
  2. Do the same for the other three frogs.
  3. Write a for loop to do this and return a NumPy array with the four mean impact forces.

Well, we can actually do it with a list comprehension in one line, but you can imagine such operations can get clunky in a hurry.

In [10]:
[(frog_id, np.mean(df.loc[df['ID']==frog_id, 'impact force (mN)'])) 
                                             for frog_id in df['ID'].unique()]
Out[10]:
[('I', 1530.2), ('II', 707.35), ('III', 550.1), ('IV', 419.1)]

Now, unfortunately, you don't get a DataFrame out of this. You only get a list of typles with the frog IDs and mean impact forces. It would require more code to get these back into a convenient DataFrame.

Using groupby(), eliminates this hassle. Let's see by example.

In [18]:
# Make a GroupBy object
gb = df.groupby('ID')

# Apply the np.mean function to the grouped object
df_mean = gb.apply(np.mean)

# Look at the new DataFrame
df_mean
Out[18]:
date trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
ID
I 1.006511e+158 2.85 1530.20 39.85 2.4715 -658.40 798.55 1.0645 -0.31375 264.35 135.80 0.3950 7644.75 -3253.2
II 1.006516e+158 2.15 707.35 37.65 1.6640 -462.30 1443.70 1.0875 -0.20625 195.25 31.75 0.7825 4568.80 -3339.3
III 1.006526e+158 2.25 550.10 43.35 4.4260 -206.75 1047.70 1.6635 -0.10275 91.05 34.65 0.4705 8322.30 -2917.6
IV 1.006526e+158 2.35 419.10 35.40 3.1200 -263.60 1239.85 1.9640 -0.12710 115.25 43.40 0.6280 3756.80 -2513.4

Sweet! Look at that! We actually calculated the mean of every variable for each frog. We have a DataFrame with the results. We can pull the mean impact force for a frog of interest using loc.

In [19]:
df_mean.loc['III', 'impact force (mN)']
Out[19]:
550.10000000000002

Now, what if we want more information, like both the mean and the median? We can apply multiple functions to a GroupBy object using the agg() method. The argument of this method is a list of functions you want to apply.

In [20]:
df_mean_median = gb.agg([np.mean, np.median])
df_mean_median
Out[20]:
trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) ... total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
mean median mean median mean median mean median mean median ... mean median mean median mean median mean median mean median
ID
I 2.85 3 1530.20 1550.5 39.85 40 2.4715 2.505 -658.40 -664.5 ... 264.35 262.5 135.80 126.5 0.3950 0.325 7644.75 5697.0 -3253.2 -2051.5
II 2.15 2 707.35 573.0 37.65 33 1.6640 1.345 -462.30 -517.0 ... 195.25 198.0 31.75 25.0 0.7825 0.830 4568.80 2632.5 -3339.3 -2066.5
III 2.25 2 550.10 544.0 43.35 34 4.4260 4.380 -206.75 -201.5 ... 91.05 102.5 34.65 23.0 0.4705 0.450 8322.30 5558.5 -2917.6 -2454.5
IV 2.35 2 419.10 460.5 35.40 33 3.1200 3.425 -263.60 -233.5 ... 115.25 126.0 43.40 40.5 0.6280 0.770 3756.80 3311.5 -2513.4 -2165.5

4 rows × 26 columns

We now have a MultiIndex for the column headers. We can index with tuples.

In [21]:
df_mean_median.loc[:, ('impact force (mN)', 'median')]
Out[21]:
ID
I      1550.5
II      573.0
III     544.0
IV      460.5
Name: (impact force (mN), median), dtype: float64

Now, let's practice with groupby().

a) Compute standard deviation of the impact forces for each frog.

b) Write a function, coeff_of_var(data), which computes the coefficient of variation of a data set. This is the standard deviation divided by the absolute value of the mean.

c) Compute the coefficient of variation of the impact forces and adhesive forces for each frog.

d) And now, finally.... Compute a DataFrame that has the mean, median, standard deviation, and coefficient of variation of the impact forces and adhesive forces for each frog.