Lesson 30: Extracting data of interest¶

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

# NumPy of course
import numpy as np

# Pandas, conventionally imported as pd
import pandas as pd

# This is how we import the module of Matplotlib we'll be using
import matplotlib.pyplot as plt

# Seaborn makes plots pretty!
import seaborn as sns

# PyBeeswarm for some better data display
import beeswarm

# Magic function to make matplotlib inline; other style specs must come AFTER
%matplotlib inline

# This enables SVG graphics inline (only use with static plots (non-Bokeh))
%config InlineBackend.figure_format = 'svg'

# Set JB's favorite Seaborn settings
rc={'lines.linewidth': 2, 'axes.labelsize': 18, 'axes.titlesize': 18, 
    'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)

In the last lesson, we learned about Pandas and dipped our toe in to see its power. In this lesson, we will work with a more complicated data set and use Pandas to handle it and pull out what we need.

The data set¶

The data set comes from Kleinteich and Gorb, Sci. Rep., 4, 2014. They measured several properties about the tongue strikes of horned frogs. Let's rake a look at the data set, which you can download here.

# Use ! to invoke a shell command. Use head to look at top 20 lines of file.
!head -n 20 ../data/frog_tongue_adhesion.csv

The first lines all begin with # signs, signifying that they are comments and not data. They do give important information, though, such as the meaning of the ID data. The ID refers to which specific frog was tested.

Immediately after the comments, we have a row of comma-separated headers. This row sets the number of columns in this data set and labels the meaning of the columns. So, we see that the first column is the date of the experiment, the second column is the ID of the frog, the third is the trial number, and so on.

After this row, each row repesents a single experiment where the frog struck the target. So, these data are already in tidy format. Let's go ahead and load the data into a DataFrame.

# Load the data
df = pd.read_csv('../data/frog_tongue_adhesion.csv', comment='#')

# Take a look
df

Selecting data¶

We can now access various subsets of the data. For example, let's say we are only interested in strong stikes, i.e., those with an impact force above one Newton. We just use Boolean slicing to get that out.

# Slice out big forces
df_big_force = df[df['impact force (mN)'] > 1000]

# Look at it
df_big_force

Notice that the indices of the individual measurements did not change! This newly formed DataFrame does not have an index 4, for example. You can think of the indices as labels on experiments/observations, not as the ordering in an array. In fact, ordering is not really relevant at all in a DataFrame.

We can also select a single experiment (very convenient that the DataFrame is tidy!). As we learned last time, we use the ix attribute of the DataFrame.

df.ix[42]

date                                                    2013_05_27
ID                                                             III
trial number                                                     3
impact force (mN)                                              324
impact time (ms)                                               105
impact force / body weight                                    2.61
adhesive force (mN)                                           -172
time frog pulls on target (ms)                                 619
adhesive force / body weight                                  1.38
adhesive impulse (N-s)                                      -0.079
total contact area (mm2)                                        55
contact area without mucus (mm2)                                23
contact area with mucus / contact area without mucus          0.37
contact pressure (Pa)                                         5946
adhesive strength (Pa)                                       -3149
Name: 42, dtype: object

Conveniently, we get all the data and labels we need.

We can select multiple columns by giving a list of column headers.

df[['impact force (mN)', 'adhesive force (mN)']]

Finding correlations¶

Now that we know how to slice out columns, we can start to find correlations. For example, we might think that the impact force and the adhesive strength might be correlated. Let's make a plot to check.

# Using Pandas automatically gives axis labels
df.plot(x='impact force (mN)', y='adhesive force (mN)', kind='scatter')

<matplotlib.axes._subplots.AxesSubplot at 0x118c262b0>

There does, in fact, seem to be some correlation. We could try any pair.

df.plot(x='total contact area (mm2)', y='adhesive force (mN)', kind='scatter')

<matplotlib.axes._subplots.AxesSubplot at 0x118f3fb00>

We can quickly compute the Pearson correlation between all pairs of data with the corr() method of the DataFrame.

df.corr()

Shortcut names¶

It is getting kind of cumbersome indexing with the long column headings. The headings are nonetheless useful, since they are descriptive and we never have problems losing track of units. But, let's say we wanted change impact force (mN) to impf for easier indexing. We can use the rename() method of DataFrames. We just pass in a dictionary of columns we want to rename.

# Rename the impact force column
df = df.rename(columns={'impact force (mN)': 'impf'})

Does it matter which frog?¶

Instead of pooling all the data together, we might want to look at each frog. We'll create a beeswarm plot of the impact forces generated from each of the four frogs.

# Separate impact forces by frog
list_of_impfs = [df['impf'][df.ID=='I'], df['impf'][df.ID=='II'],
                 df['impf'][df.ID=='III'], df['impf'][df.ID=='IV']]

# Generate a beeswarm plot
beeswarm.beeswarm(list_of_impfs, labels=['I', 'II', 'III', 'IV'])
plt.ylabel('impact force (mN)')

<matplotlib.text.Text at 0x119414518>

We see that frog I was a big hitter, but is pretty inconsistent. The juvenile frogs, III and IV, strike with much less force. They're much smaller, and are actually even a different species.

This plot is very simple, and using Pandas DataFrames made it very easy to generate.

We may want to check if frog I is moody from day to day. So, we can plot the impact force as a function of date. We first have to convert the strings like '2013_02_26' to dates that Pandas recognized.

# Convert to dates
df['date'] = pd.to_datetime(df['date'], format='%Y_%m_%d')

# Make a plot for frog I of impact force vs date
df[df['ID']=='I'].plot(x='date', y='impf', kind='line', marker='o', linestyle='',
                      legend=False)
plt.ylabel('impact force (mN)')

<matplotlib.text.Text at 0x11a424eb8>

So, there does not appear to be any correlation between what day frog I strikes and how hard he or she strikes. Remember, it was as simple as using the df[df['ID']=='I'] syntax to pull out these data! Pandas is a powerful ally.

	date	ID	trial number	impact force (mN)	impact time (ms)	impact force / body weight	adhesive force (mN)	time frog pulls on target (ms)	adhesive force / body weight	adhesive impulse (N-s)	total contact area (mm2)	contact area without mucus (mm2)	contact area with mucus / contact area without mucus	contact pressure (Pa)	adhesive strength (Pa)
0	2013_02_26	I	3	1205	46	1.95	-785	884	1.27	-0.290	387	70	0.82	3117	-2030
1	2013_02_26	I	4	2527	44	4.08	-983	248	1.59	-0.181	101	94	0.07	24923	-9695
2	2013_03_01	I	1	1745	34	2.82	-850	211	1.37	-0.157	83	79	0.05	21020	-10239
3	2013_03_01	I	2	1556	41	2.51	-455	1025	0.74	-0.170	330	158	0.52	4718	-1381
4	2013_03_01	I	3	493	36	0.80	-974	499	1.57	-0.423	245	216	0.12	2012	-3975
5	2013_03_01	I	4	2276	31	3.68	-592	969	0.96	-0.176	341	106	0.69	6676	-1737
6	2013_03_05	I	1	556	43	0.90	-512	835	0.83	-0.285	359	110	0.69	1550	-1427
7	2013_03_05	I	2	1928	46	3.11	-804	508	1.30	-0.285	246	178	0.28	7832	-3266
8	2013_03_05	I	3	2641	50	4.27	-690	491	1.12	-0.239	269	224	0.17	9824	-2568
9	2013_03_05	I	4	1897	41	3.06	-462	839	0.75	-0.328	266	176	0.34	7122	-1733
10	2013_03_12	I	1	1891	40	3.06	-766	1069	1.24	-0.380	408	33	0.92	4638	-1879
11	2013_03_12	I	2	1545	48	2.50	-715	649	1.15	-0.298	141	112	0.21	10947	-5064
12	2013_03_12	I	3	1307	29	2.11	-613	1845	0.99	-0.768	455	92	0.80	2874	-1348
13	2013_03_12	I	4	1692	31	2.73	-677	917	1.09	-0.457	186	129	0.31	9089	-3636
14	2013_03_12	I	5	1543	38	2.49	-528	750	0.85	-0.353	153	148	0.03	10095	-3453
15	2013_03_15	I	1	1282	31	2.07	-452	785	0.73	-0.253	290	105	0.64	4419	-1557
16	2013_03_15	I	2	775	34	1.25	-430	837	0.70	-0.276	257	124	0.52	3019	-1677
17	2013_03_15	I	3	2032	60	3.28	-652	486	1.05	-0.257	147	134	0.09	13784	-4425
18	2013_03_15	I	4	1240	34	2.00	-692	906	1.12	-0.317	364	260	0.28	3406	-1901
19	2013_03_15	I	5	473	40	0.76	-536	1218	0.87	-0.382	259	168	0.35	1830	-2073
20	2013_03_19	II	1	1612	18	3.79	-655	3087	1.54	-0.385	348	15	0.96	4633	-1881
21	2013_03_19	II	2	605	55	1.42	-292	1261	0.69	-0.294	248	24	0.90	2441	-1177
22	2013_03_19	II	3	327	51	0.77	-246	1508	0.58	-0.282	130	34	0.74	2517	-1894
23	2013_03_19	II	4	946	59	2.23	-245	1841	0.58	-0.340	106	26	0.76	8893	-2301
24	2013_03_21	II	1	541	33	1.27	-553	3126	1.30	-0.432	276	16	0.94	1959	-2004
25	2013_03_21	II	2	1539	43	3.62	-664	741	1.56	-0.046	85	24	0.72	18073	-7802
26	2013_03_21	II	3	529	28	1.24	-261	2482	0.61	-0.414	325	33	0.90	1627	-803
27	2013_03_21	II	4	628	31	1.48	-691	998	1.63	-0.071	242	67	0.72	2600	-2860
28	2013_03_25	II	1	1453	72	3.42	-92	1652	0.22	-0.008	136	0	1.00	10645	-678
29	2013_03_25	II	2	297	42	0.70	-566	936	1.33	-0.084	126	4	0.97	2367	-4506
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
50	2013_06_18	III	3	458	30	3.69	-259	910	2.08	-0.194	91	88	0.15	5048	-2855
51	2013_06_18	III	4	626	16	5.04	-231	550	1.86	-0.042	82	23	0.01	7633	-2819
52	2013_06_21	III	1	621	27	4.99	-267	2081	2.14	-0.183	120	58	0.90	5152	-2213
53	2013_06_21	III	2	544	30	4.38	-178	376	1.43	-0.034	19	17	0.05	28641	-9364
54	2013_06_21	III	3	535	35	4.30	-123	289	0.99	-0.029	21	29	0.05	25471	-5843
55	2013_06_21	III	4	385	39	3.09	-151	607	1.22	-0.082	31	126	0.03	12409	-4882
56	2013_06_26	III	1	401	36	3.23	-127	2932	1.02	-0.215	142	12	0.86	2835	-896
57	2013_06_26	III	2	614	34	4.94	-372	680	2.99	-0.140	72	1	0.42	8475	-5136
58	2013_06_26	III	3	665	40	5.35	-236	685	1.90	-0.118	129	0	0.16	5171	-1834
59	2013_06_26	III	4	488	34	3.93	-390	1308	3.14	-0.208	112	58	0.39	4376	-3492
60	2013_05_27	IV	2	172	26	1.28	-456	462	3.40	-0.050	133	0	0.88	1297	-3440
61	2013_05_27	IV	3	142	20	1.05	-193	250	1.44	-0.047	57	74	0.83	2498	-3400
62	2013_05_27	IV	4	37	55	0.28	-236	743	1.76	-0.119	51	44	0.54	735	-4647
63	2013_05_30	IV	1	453	38	3.37	-225	844	1.68	-0.110	142	108	0.48	3177	-1581
64	2013_05_30	IV	2	355	31	2.64	-217	728	1.61	-0.023	174	39	0.98	2037	-1245
65	2013_05_30	IV	3	22	33	0.17	-161	472	1.20	-0.052	56	4	0.02	397	-2866
66	2013_06_03	IV	1	502	74	3.74	-139	959	1.04	-0.089	65	77	0.91	7713	-2141
67	2013_06_11	IV	1	273	26	2.03	-264	844	1.97	-0.160	124	81	0.29	2205	-2136
68	2013_06_11	IV	2	720	27	5.36	-342	1515	2.55	-0.226	137	0	0.83	5259	-2497
69	2013_06_11	IV	3	582	33	4.34	-231	279	1.72	-0.033	60	4	0.03	9705	-3847
70	2013_06_11	IV	4	198	23	1.47	-209	1427	1.55	-0.151	110	69	0.84	1793	-1889
71	2013_06_14	IV	1	198	6	1.47	-292	2874	2.17	-0.232	145	50	0.99	1369	-2018
72	2013_06_18	IV	1	597	29	4.44	-339	4251	2.53	-0.281	191	12	1.00	3116	-1772
73	2013_06_18	IV	2	516	31	3.84	-371	626	2.76	-0.094	83	18	0.12	6184	-4447
74	2013_06_18	IV	3	815	34	6.07	-331	1254	2.47	-0.077	151	20	0.71	5386	-2190
75	2013_06_18	IV	4	402	38	3.00	-302	986	2.25	-0.122	117	30	0.07	3446	-2591
76	2013_06_21	IV	1	605	39	4.50	-216	1627	1.61	-0.139	123	20	1.00	4928	-1759
77	2013_06_21	IV	2	711	76	5.30	-163	2021	1.21	-0.217	129	42	0.97	5498	-1257
78	2013_06_21	IV	3	614	33	4.57	-367	1366	2.73	-0.198	128	108	0.46	4776	-2857
79	2013_06_21	IV	4	468	36	3.48	-218	1269	1.63	-0.122	129	68	0.61	3617	-1688

	date	ID	trial number	impact force (mN)	impact time (ms)	impact force / body weight	adhesive force (mN)	time frog pulls on target (ms)	adhesive force / body weight	adhesive impulse (N-s)	total contact area (mm2)	contact area without mucus (mm2)	contact area with mucus / contact area without mucus	contact pressure (Pa)	adhesive strength (Pa)
0	2013_02_26	I	3	1205	46	1.95	-785	884	1.27	-0.290	387	70	0.82	3117	-2030
1	2013_02_26	I	4	2527	44	4.08	-983	248	1.59	-0.181	101	94	0.07	24923	-9695
2	2013_03_01	I	1	1745	34	2.82	-850	211	1.37	-0.157	83	79	0.05	21020	-10239
3	2013_03_01	I	2	1556	41	2.51	-455	1025	0.74	-0.170	330	158	0.52	4718	-1381
5	2013_03_01	I	4	2276	31	3.68	-592	969	0.96	-0.176	341	106	0.69	6676	-1737
7	2013_03_05	I	2	1928	46	3.11	-804	508	1.30	-0.285	246	178	0.28	7832	-3266
8	2013_03_05	I	3	2641	50	4.27	-690	491	1.12	-0.239	269	224	0.17	9824	-2568
9	2013_03_05	I	4	1897	41	3.06	-462	839	0.75	-0.328	266	176	0.34	7122	-1733
10	2013_03_12	I	1	1891	40	3.06	-766	1069	1.24	-0.380	408	33	0.92	4638	-1879
11	2013_03_12	I	2	1545	48	2.50	-715	649	1.15	-0.298	141	112	0.21	10947	-5064
12	2013_03_12	I	3	1307	29	2.11	-613	1845	0.99	-0.768	455	92	0.80	2874	-1348
13	2013_03_12	I	4	1692	31	2.73	-677	917	1.09	-0.457	186	129	0.31	9089	-3636
14	2013_03_12	I	5	1543	38	2.49	-528	750	0.85	-0.353	153	148	0.03	10095	-3453
15	2013_03_15	I	1	1282	31	2.07	-452	785	0.73	-0.253	290	105	0.64	4419	-1557
17	2013_03_15	I	3	2032	60	3.28	-652	486	1.05	-0.257	147	134	0.09	13784	-4425
18	2013_03_15	I	4	1240	34	2.00	-692	906	1.12	-0.317	364	260	0.28	3406	-1901
20	2013_03_19	II	1	1612	18	3.79	-655	3087	1.54	-0.385	348	15	0.96	4633	-1881
25	2013_03_21	II	2	1539	43	3.62	-664	741	1.56	-0.046	85	24	0.72	18073	-7802
28	2013_03_25	II	1	1453	72	3.42	-92	1652	0.22	-0.008	136	0	1.00	10645	-678
34	2013_04_03	II	1	1182	28	2.78	-522	1197	1.23	-0.118	281	0	1.00	4213	-1860

	impact force (mN)	adhesive force (mN)
0	1205	-785
1	2527	-983
2	1745	-850
3	1556	-455
4	493	-974
5	2276	-592
6	556	-512
7	1928	-804
8	2641	-690
9	1897	-462
10	1891	-766
11	1545	-715
12	1307	-613
13	1692	-677
14	1543	-528
15	1282	-452
16	775	-430
17	2032	-652
18	1240	-692
19	473	-536
20	1612	-655
21	605	-292
22	327	-246
23	946	-245
24	541	-553
25	1539	-664
26	529	-261
27	628	-691
28	1453	-92
29	297	-566
...	...	...
50	458	-259
51	626	-231
52	621	-267
53	544	-178
54	535	-123
55	385	-151
56	401	-127
57	614	-372
58	665	-236
59	488	-390
60	172	-456
61	142	-193
62	37	-236
63	453	-225
64	355	-217
65	22	-161
66	502	-139
67	273	-264
68	720	-342
69	582	-231
70	198	-209
71	198	-292
72	597	-339
73	516	-371
74	815	-331
75	402	-302
76	605	-216
77	711	-163
78	614	-367
79	468	-218

	trial number	impact force (mN)	impact time (ms)	impact force / body weight	adhesive force (mN)	time frog pulls on target (ms)	adhesive force / body weight	adhesive impulse (N-s)	total contact area (mm2)	contact area without mucus (mm2)	contact area with mucus / contact area without mucus	contact pressure (Pa)	adhesive strength (Pa)
trial number	1.000000	0.081059	-0.012223	-0.165235	-0.144249	-0.300896	-0.000758	-0.152892	-0.070046	0.338473	-0.487218	0.105460	-0.228193
impact force (mN)	0.081059	1.000000	0.065284	0.233450	-0.581344	-0.121341	-0.282177	-0.328656	0.436910	0.447339	-0.163551	0.436010	-0.087942
impact time (ms)	-0.012223	0.065284	1.000000	0.060478	0.166485	-0.224467	-0.276984	0.184928	-0.151079	-0.054654	-0.001410	0.088254	-0.010054
impact force / body weight	-0.165235	0.233450	0.060478	1.000000	0.313570	0.055374	0.327074	0.303916	-0.291663	-0.190505	-0.084925	0.397020	0.089783
adhesive force (mN)	-0.144249	-0.581344	0.166485	0.313570	1.000000	0.172685	-0.017410	0.518693	-0.517483	-0.488566	0.194437	-0.098643	0.336116
time frog pulls on target (ms)	-0.300896	-0.121341	-0.224467	0.055374	0.172685	1.000000	0.006804	-0.369147	0.282509	-0.262914	0.620117	-0.353474	0.434311
adhesive force / body weight	-0.000758	-0.282177	-0.276984	0.327074	-0.017410	0.006804	1.000000	0.200359	-0.316363	-0.237184	-0.146312	-0.097282	-0.154410
adhesive impulse (N-s)	-0.152892	-0.328656	0.184928	0.303916	0.518693	-0.369147	0.200359	1.000000	-0.684423	-0.434937	-0.058184	0.224038	-0.187740
total contact area (mm2)	-0.070046	0.436910	-0.151079	-0.291663	-0.517483	0.282509	-0.316363	-0.684423	1.000000	0.383562	0.310887	-0.390094	0.417100
contact area without mucus (mm2)	0.338473	0.447339	-0.054654	-0.190505	-0.488566	-0.262914	-0.237184	-0.434937	0.383562	1.000000	-0.394482	0.026032	0.007942
contact area with mucus / contact area without mucus	-0.487218	-0.163551	-0.001410	-0.084925	0.194437	0.620117	-0.146312	-0.058184	0.310887	-0.394482	1.000000	-0.454396	0.506204
contact pressure (Pa)	0.105460	0.436010	0.088254	0.397020	-0.098643	-0.353474	-0.097282	0.224038	-0.390094	0.026032	-0.454396	1.000000	-0.600649
adhesive strength (Pa)	-0.228193	-0.087942	-0.010054	0.089783	0.336116	0.434311	-0.154410	-0.187740	0.417100	0.007942	0.506204	-0.600649	1.000000