Lesson 31: Extracting data of interest

(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

In [2]:
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
rc={'lines.linewidth': 2, 'axes.labelsize': 18, 'axes.titlesize': 18}
sns.set(rc=rc)

# The following is specific Jupyter notebooks
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}

In the last lesson, we learned about Pandas and dipped our toe in to see its power. In this lesson, we will work with a more complicated data set and use Pandas to handle it and pull out what we need.

The data set

The data set comes from Kleinteich and Gorb, Sci. Rep., 4, 2014, and was featured in the New York Times. They measured several properties about the tongue strikes of horned frogs. Let's take a look at the data set, which is in the file ~git/data/frog_tongue_adhesion.csv.

In [3]:
!head -n 20 data/frog_tongue_adhesion.csv



















The first lines all begin with # signs, signifying that they are comments and not data. They do give important information, though, such as the meaning of the ID data. The ID refers to which specific frog was tested.

Immediately after the comments, we have a row of comma-separated headers. This row sets the number of columns in this data set and labels the meaning of the columns. So, we see that the first column is the date of the experiment, the second column is the ID of the frog, the third is the trial number, and so on.

After this row, each row represents a single experiment where the frog struck the target. So, these data are already in tidy format. Let's go ahead and load the data into a DataFrame.

In [4]:
# Load the data
df = pd.read_csv('data/frog_tongue_adhesion.csv', comment='#')

# Take a look
df
Out[4]:
date ID trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
0 2013_02_26 I 3 1205 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030
1 2013_02_26 I 4 2527 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695
2 2013_03_01 I 1 1745 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239
3 2013_03_01 I 2 1556 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381
4 2013_03_01 I 3 493 36 0.80 -974 499 1.57 -0.423 245 216 0.12 2012 -3975
5 2013_03_01 I 4 2276 31 3.68 -592 969 0.96 -0.176 341 106 0.69 6676 -1737
6 2013_03_05 I 1 556 43 0.90 -512 835 0.83 -0.285 359 110 0.69 1550 -1427
7 2013_03_05 I 2 1928 46 3.11 -804 508 1.30 -0.285 246 178 0.28 7832 -3266
8 2013_03_05 I 3 2641 50 4.27 -690 491 1.12 -0.239 269 224 0.17 9824 -2568
9 2013_03_05 I 4 1897 41 3.06 -462 839 0.75 -0.328 266 176 0.34 7122 -1733
10 2013_03_12 I 1 1891 40 3.06 -766 1069 1.24 -0.380 408 33 0.92 4638 -1879
11 2013_03_12 I 2 1545 48 2.50 -715 649 1.15 -0.298 141 112 0.21 10947 -5064
12 2013_03_12 I 3 1307 29 2.11 -613 1845 0.99 -0.768 455 92 0.80 2874 -1348
13 2013_03_12 I 4 1692 31 2.73 -677 917 1.09 -0.457 186 129 0.31 9089 -3636
14 2013_03_12 I 5 1543 38 2.49 -528 750 0.85 -0.353 153 148 0.03 10095 -3453
15 2013_03_15 I 1 1282 31 2.07 -452 785 0.73 -0.253 290 105 0.64 4419 -1557
16 2013_03_15 I 2 775 34 1.25 -430 837 0.70 -0.276 257 124 0.52 3019 -1677
17 2013_03_15 I 3 2032 60 3.28 -652 486 1.05 -0.257 147 134 0.09 13784 -4425
18 2013_03_15 I 4 1240 34 2.00 -692 906 1.12 -0.317 364 260 0.28 3406 -1901
19 2013_03_15 I 5 473 40 0.76 -536 1218 0.87 -0.382 259 168 0.35 1830 -2073
20 2013_03_19 II 1 1612 18 3.79 -655 3087 1.54 -0.385 348 15 0.96 4633 -1881
21 2013_03_19 II 2 605 55 1.42 -292 1261 0.69 -0.294 248 24 0.90 2441 -1177
22 2013_03_19 II 3 327 51 0.77 -246 1508 0.58 -0.282 130 34 0.74 2517 -1894
23 2013_03_19 II 4 946 59 2.23 -245 1841 0.58 -0.340 106 26 0.76 8893 -2301
24 2013_03_21 II 1 541 33 1.27 -553 3126 1.30 -0.432 276 16 0.94 1959 -2004
25 2013_03_21 II 2 1539 43 3.62 -664 741 1.56 -0.046 85 24 0.72 18073 -7802
26 2013_03_21 II 3 529 28 1.24 -261 2482 0.61 -0.414 325 33 0.90 1627 -803
27 2013_03_21 II 4 628 31 1.48 -691 998 1.63 -0.071 242 67 0.72 2600 -2860
28 2013_03_25 II 1 1453 72 3.42 -92 1652 0.22 -0.008 136 0 1.00 10645 -678
29 2013_03_25 II 2 297 42 0.70 -566 936 1.33 -0.084 126 4 0.97 2367 -4506
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
50 2013_06_18 III 3 458 30 3.69 -259 910 2.08 -0.194 91 88 0.15 5048 -2855
51 2013_06_18 III 4 626 16 5.04 -231 550 1.86 -0.042 82 23 0.01 7633 -2819
52 2013_06_21 III 1 621 27 4.99 -267 2081 2.14 -0.183 120 58 0.90 5152 -2213
53 2013_06_21 III 2 544 30 4.38 -178 376 1.43 -0.034 19 17 0.05 28641 -9364
54 2013_06_21 III 3 535 35 4.30 -123 289 0.99 -0.029 21 29 0.05 25471 -5843
55 2013_06_21 III 4 385 39 3.09 -151 607 1.22 -0.082 31 126 0.03 12409 -4882
56 2013_06_26 III 1 401 36 3.23 -127 2932 1.02 -0.215 142 12 0.86 2835 -896
57 2013_06_26 III 2 614 34 4.94 -372 680 2.99 -0.140 72 1 0.42 8475 -5136
58 2013_06_26 III 3 665 40 5.35 -236 685 1.90 -0.118 129 0 0.16 5171 -1834
59 2013_06_26 III 4 488 34 3.93 -390 1308 3.14 -0.208 112 58 0.39 4376 -3492
60 2013_05_27 IV 2 172 26 1.28 -456 462 3.40 -0.050 133 0 0.88 1297 -3440
61 2013_05_27 IV 3 142 20 1.05 -193 250 1.44 -0.047 57 74 0.83 2498 -3400
62 2013_05_27 IV 4 37 55 0.28 -236 743 1.76 -0.119 51 44 0.54 735 -4647
63 2013_05_30 IV 1 453 38 3.37 -225 844 1.68 -0.110 142 108 0.48 3177 -1581
64 2013_05_30 IV 2 355 31 2.64 -217 728 1.61 -0.023 174 39 0.98 2037 -1245
65 2013_05_30 IV 3 22 33 0.17 -161 472 1.20 -0.052 56 4 0.02 397 -2866
66 2013_06_03 IV 1 502 74 3.74 -139 959 1.04 -0.089 65 77 0.91 7713 -2141
67 2013_06_11 IV 1 273 26 2.03 -264 844 1.97 -0.160 124 81 0.29 2205 -2136
68 2013_06_11 IV 2 720 27 5.36 -342 1515 2.55 -0.226 137 0 0.83 5259 -2497
69 2013_06_11 IV 3 582 33 4.34 -231 279 1.72 -0.033 60 4 0.03 9705 -3847
70 2013_06_11 IV 4 198 23 1.47 -209 1427 1.55 -0.151 110 69 0.84 1793 -1889
71 2013_06_14 IV 1 198 6 1.47 -292 2874 2.17 -0.232 145 50 0.99 1369 -2018
72 2013_06_18 IV 1 597 29 4.44 -339 4251 2.53 -0.281 191 12 1.00 3116 -1772
73 2013_06_18 IV 2 516 31 3.84 -371 626 2.76 -0.094 83 18 0.12 6184 -4447
74 2013_06_18 IV 3 815 34 6.07 -331 1254 2.47 -0.077 151 20 0.71 5386 -2190
75 2013_06_18 IV 4 402 38 3.00 -302 986 2.25 -0.122 117 30 0.07 3446 -2591
76 2013_06_21 IV 1 605 39 4.50 -216 1627 1.61 -0.139 123 20 1.00 4928 -1759
77 2013_06_21 IV 2 711 76 5.30 -163 2021 1.21 -0.217 129 42 0.97 5498 -1257
78 2013_06_21 IV 3 614 33 4.57 -367 1366 2.73 -0.198 128 108 0.46 4776 -2857
79 2013_06_21 IV 4 468 36 3.48 -218 1269 1.63 -0.122 129 68 0.61 3617 -1688

80 rows × 15 columns

Selecting data

We can now access various subsets of the data. For example, let's say we are only interested in strong strikes, i.e., those with an impact force above one Newton. We just use Boolean slicing to get that out.

In [5]:
# Slice out big forces
df_big_force = df[df['impact force (mN)'] > 1000]

# Look at it
df_big_force
Out[5]:
date ID trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
0 2013_02_26 I 3 1205 46 1.95 -785 884 1.27 -0.290 387 70 0.82 3117 -2030
1 2013_02_26 I 4 2527 44 4.08 -983 248 1.59 -0.181 101 94 0.07 24923 -9695
2 2013_03_01 I 1 1745 34 2.82 -850 211 1.37 -0.157 83 79 0.05 21020 -10239
3 2013_03_01 I 2 1556 41 2.51 -455 1025 0.74 -0.170 330 158 0.52 4718 -1381
5 2013_03_01 I 4 2276 31 3.68 -592 969 0.96 -0.176 341 106 0.69 6676 -1737
7 2013_03_05 I 2 1928 46 3.11 -804 508 1.30 -0.285 246 178 0.28 7832 -3266
8 2013_03_05 I 3 2641 50 4.27 -690 491 1.12 -0.239 269 224 0.17 9824 -2568
9 2013_03_05 I 4 1897 41 3.06 -462 839 0.75 -0.328 266 176 0.34 7122 -1733
10 2013_03_12 I 1 1891 40 3.06 -766 1069 1.24 -0.380 408 33 0.92 4638 -1879
11 2013_03_12 I 2 1545 48 2.50 -715 649 1.15 -0.298 141 112 0.21 10947 -5064
12 2013_03_12 I 3 1307 29 2.11 -613 1845 0.99 -0.768 455 92 0.80 2874 -1348
13 2013_03_12 I 4 1692 31 2.73 -677 917 1.09 -0.457 186 129 0.31 9089 -3636
14 2013_03_12 I 5 1543 38 2.49 -528 750 0.85 -0.353 153 148 0.03 10095 -3453
15 2013_03_15 I 1 1282 31 2.07 -452 785 0.73 -0.253 290 105 0.64 4419 -1557
17 2013_03_15 I 3 2032 60 3.28 -652 486 1.05 -0.257 147 134 0.09 13784 -4425
18 2013_03_15 I 4 1240 34 2.00 -692 906 1.12 -0.317 364 260 0.28 3406 -1901
20 2013_03_19 II 1 1612 18 3.79 -655 3087 1.54 -0.385 348 15 0.96 4633 -1881
25 2013_03_21 II 2 1539 43 3.62 -664 741 1.56 -0.046 85 24 0.72 18073 -7802
28 2013_03_25 II 1 1453 72 3.42 -92 1652 0.22 -0.008 136 0 1.00 10645 -678
34 2013_04_03 II 1 1182 28 2.78 -522 1197 1.23 -0.118 281 0 1.00 4213 -1860

Notice that the indices of the individual measurements did not change! This newly formed DataFrame does not have an index 4, for example. You can think of the indices as labels on experiments/observations, not as the ordering in an array. In fact, ordering is not really relevant at all in a DataFrame.

We can also select a single experiment (very convenient that the DataFrame is tidy!). As we learned last time, we use .loc.

In [6]:
df.loc[42, :]
Out[6]:
date                                                    2013_05_27
ID                                                             III
trial number                                                     3
impact force (mN)                                              324
impact time (ms)                                               105
impact force / body weight                                    2.61
adhesive force (mN)                                           -172
time frog pulls on target (ms)                                 619
adhesive force / body weight                                  1.38
adhesive impulse (N-s)                                      -0.079
total contact area (mm2)                                        55
contact area without mucus (mm2)                                23
contact area with mucus / contact area without mucus          0.37
contact pressure (Pa)                                         5946
adhesive strength (Pa)                                       -3149
Name: 42, dtype: object

Conveniently, we get all the data and labels we need.

We can select multiple columns by giving a list of column headers.

In [7]:
df.loc[:, ['impact force (mN)', 'adhesive force (mN)']]
Out[7]:
impact force (mN) adhesive force (mN)
0 1205 -785
1 2527 -983
2 1745 -850
3 1556 -455
4 493 -974
5 2276 -592
6 556 -512
7 1928 -804
8 2641 -690
9 1897 -462
10 1891 -766
11 1545 -715
12 1307 -613
13 1692 -677
14 1543 -528
15 1282 -452
16 775 -430
17 2032 -652
18 1240 -692
19 473 -536
20 1612 -655
21 605 -292
22 327 -246
23 946 -245
24 541 -553
25 1539 -664
26 529 -261
27 628 -691
28 1453 -92
29 297 -566
... ... ...
50 458 -259
51 626 -231
52 621 -267
53 544 -178
54 535 -123
55 385 -151
56 401 -127
57 614 -372
58 665 -236
59 488 -390
60 172 -456
61 142 -193
62 37 -236
63 453 -225
64 355 -217
65 22 -161
66 502 -139
67 273 -264
68 720 -342
69 582 -231
70 198 -209
71 198 -292
72 597 -339
73 516 -371
74 815 -331
75 402 -302
76 605 -216
77 711 -163
78 614 -367
79 468 -218

80 rows × 2 columns

We can also do Boolean indexing using loc. Say we want Frog I's impact force and adhesive force.

In [8]:
df.loc[df['ID']=='I', ['impact force (mN)', 'adhesive force (mN)']]
Out[8]:
impact force (mN) adhesive force (mN)
0 1205 -785
1 2527 -983
2 1745 -850
3 1556 -455
4 493 -974
5 2276 -592
6 556 -512
7 1928 -804
8 2641 -690
9 1897 -462
10 1891 -766
11 1545 -715
12 1307 -613
13 1692 -677
14 1543 -528
15 1282 -452
16 775 -430
17 2032 -652
18 1240 -692
19 473 -536

This usage, with Boolean indexing and data selection is probably the most often used technique with Pandas. At least it is in my workflow.

Finding correlations

Now that we know how to slice out columns, we can start to find correlations. For example, we might think that the impact force and the adhesive strength might be correlated. Let's make a plot to check.

In [9]:
plt.plot(df['impact force (mN)'], df['adhesive force (mN)'], marker='.',
         linestyle='none')
plt.xlabel('impact force (mN)')
plt.ylabel('adhesive force (mN)')
Out[9]:
<matplotlib.text.Text at 0x117ac9898>

There does, in fact, seem to be some correlation. We could try any pair. As a trick to allow fast plotting, we will use the DataFrame's built-in plot() method, that allows us to quickly make a plot with the axes already labeled.

In [10]:
df.plot(x='total contact area (mm2)', y='adhesive force (mN)', kind='scatter')
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x119c5c128>

There are other slick methods built in to DataFrames. For example, we can quickly compute the Pearson correlation between all pairs of data with the corr() method of the DataFrame.

In [11]:
df.corr()
Out[11]:
trial number impact force (mN) impact time (ms) impact force / body weight adhesive force (mN) time frog pulls on target (ms) adhesive force / body weight adhesive impulse (N-s) total contact area (mm2) contact area without mucus (mm2) contact area with mucus / contact area without mucus contact pressure (Pa) adhesive strength (Pa)
trial number 1.000000 0.081059 -0.012223 -0.165235 -0.144249 -0.300896 -0.000758 -0.152892 -0.070046 0.338473 -0.487218 0.105460 -0.228193
impact force (mN) 0.081059 1.000000 0.065284 0.233450 -0.581344 -0.121341 -0.282177 -0.328656 0.436910 0.447339 -0.163551 0.436010 -0.087942
impact time (ms) -0.012223 0.065284 1.000000 0.060478 0.166485 -0.224467 -0.276984 0.184928 -0.151079 -0.054654 -0.001410 0.088254 -0.010054
impact force / body weight -0.165235 0.233450 0.060478 1.000000 0.313570 0.055374 0.327074 0.303916 -0.291663 -0.190505 -0.084925 0.397020 0.089783
adhesive force (mN) -0.144249 -0.581344 0.166485 0.313570 1.000000 0.172685 -0.017410 0.518693 -0.517483 -0.488566 0.194437 -0.098643 0.336116
time frog pulls on target (ms) -0.300896 -0.121341 -0.224467 0.055374 0.172685 1.000000 0.006804 -0.369147 0.282509 -0.262914 0.620117 -0.353474 0.434311
adhesive force / body weight -0.000758 -0.282177 -0.276984 0.327074 -0.017410 0.006804 1.000000 0.200359 -0.316363 -0.237184 -0.146312 -0.097282 -0.154410
adhesive impulse (N-s) -0.152892 -0.328656 0.184928 0.303916 0.518693 -0.369147 0.200359 1.000000 -0.684423 -0.434937 -0.058184 0.224038 -0.187740
total contact area (mm2) -0.070046 0.436910 -0.151079 -0.291663 -0.517483 0.282509 -0.316363 -0.684423 1.000000 0.383562 0.310887 -0.390094 0.417100
contact area without mucus (mm2) 0.338473 0.447339 -0.054654 -0.190505 -0.488566 -0.262914 -0.237184 -0.434937 0.383562 1.000000 -0.394482 0.026032 0.007942
contact area with mucus / contact area without mucus -0.487218 -0.163551 -0.001410 -0.084925 0.194437 0.620117 -0.146312 -0.058184 0.310887 -0.394482 1.000000 -0.454396 0.506204
contact pressure (Pa) 0.105460 0.436010 0.088254 0.397020 -0.098643 -0.353474 -0.097282 0.224038 -0.390094 0.026032 -0.454396 1.000000 -0.600649
adhesive strength (Pa) -0.228193 -0.087942 -0.010054 0.089783 0.336116 0.434311 -0.154410 -0.187740 0.417100 0.007942 0.506204 -0.600649 1.000000

Shortcut names

It is getting kind of cumbersome indexing with the long column headings. The headings are nonetheless useful, since they are descriptive and we never have problems losing track of units. But, let's say we wanted change impact force (mN) to impf for easier indexing. We can use the rename() method of DataFrames. We just pass in a dictionary of columns we want to rename.

In [12]:
# Rename the impact force column
df = df.rename(columns={'impact force (mN)': 'impf'})

We will explore the power of Pandas more in the next lesson, when we practice with DataFrames.