Exercise 5.2: Split-Apply-Combine of the frog data set¶

We will continue working with the frog tongue adhesion data set.

You’ll now practice your split-apply-combine skills. First load in the data set. Then,

a) Compute standard deviation of the impact forces for each frog.

b) Compute the coefficient of variation of the impact forces and adhesive forces for each frog.

c) Compute a data frame that has the mean, median, standard deviation, and coefficient of variation of the impact forces and adhesive forces for each frog.

d) Now tidy this data frame. It might help to read the documentation about melting.

Solution¶

[1]:

import numpy as np
import pandas as pd

Of course, we start by loading in the data frame.

[2]:

df = pd.read_csv('data/frog_tongue_adhesion.csv', comment='#')

a) To compute the standard deviation of impact forces for each frog, we first group by the frog ID and then apply the std() method to the GroupBy object.

[3]:

grouped = df.groupby('ID')

grouped['impact force (mN)'].std()

[3]:

ID
I      630.207952
II     424.573256
III    124.273849
IV     234.864328
Name: impact force (mN), dtype: float64

We could also do this by method chaining.

[4]:

df.groupby('ID')['impact force (mN)'].std()

[4]:

ID
I      630.207952
II     424.573256
III    124.273849
IV     234.864328
Name: impact force (mN), dtype: float64

b) We first write a function to compute the coefficient of variation. We saw this in our lesson on tidy data and split-apply-combine.

[5]:

def coeff_of_var(data):
    """Coefficient of variation."""
    return np.std(data) / np.abs(np.mean(data))

Next, we group the data frame with by the ID column.

[6]:

grouped = df.groupby('ID')

Finally, we select the columns of interest and apply the coeff_of_var() aggregating function.

[7]:

df_cv = grouped[['impact force (mN)', 'adhesive force (mN)']].agg(coeff_of_var)

Finally, we reset the index to get our resulting data frame.

[8]:

df_cv.reset_index()

[8]:

	ID	impact force (mN)	adhesive force (mN)
0	I	0.401419	0.247435
1	II	0.585033	0.429701
2	III	0.220191	0.415435
3	IV	0.546212	0.308042

Again, we could do this in one line by method chaining.

[9]:

df.groupby("ID")[["impact force (mN)", "adhesive force (mN)"]].agg(
    coeff_of_var
).reset_index()

[9]:

	ID	impact force (mN)	adhesive force (mN)
0	I	0.401419	0.247435
1	II	0.585033	0.429701
2	III	0.220191	0.415435
3	IV	0.546212	0.308042

c) Now we will apply all of the statistical functions to the impact force and adhesive force. This is as simple as using a list of aggregating functions in the agg() method of the GroupBy object.

[10]:

df_result = (
    df.groupby("ID")[["impact force (mN)", "adhesive force (mN)"]]
    .agg([np.mean, np.median, np.std, coeff_of_var])
    .reset_index()
)

# Take a look
df_result

[10]:

	ID	impact force (mN)				adhesive force (mN)
		mean	median	std	coeff_of_var	mean	median	std	coeff_of_var
0	I	1530.20	1550.5	630.207952	0.401419	-658.40	-664.5	167.143619	0.247435
1	II	707.35	573.0	424.573256	0.585033	-462.30	-517.0	203.811600	0.429701
2	III	550.10	544.0	124.273849	0.220191	-206.75	-201.5	88.122448	0.415435
3	IV	419.10	460.5	234.864328	0.546212	-263.60	-233.5	83.309442	0.308042

d) We can index these things using the MultiIndex of the columns, but we much prefer tidy data frames, which we can generate again use pd.melt(). The var_name kwarg specifies what we name the levels of the MultiIndex, and the id_vars kwarg tells us which column to keep as it is.

[11]:

# Melt the DataFrame to make it tidy
pd.melt(df_result, var_name=['quantity', 'statistic'], id_vars='ID')

[11]:

	ID	quantity	statistic	value
0	I	impact force (mN)	mean	1530.200000
1	II	impact force (mN)	mean	707.350000
2	III	impact force (mN)	mean	550.100000
3	IV	impact force (mN)	mean	419.100000
4	I	impact force (mN)	median	1550.500000
5	II	impact force (mN)	median	573.000000
6	III	impact force (mN)	median	544.000000
7	IV	impact force (mN)	median	460.500000
8	I	impact force (mN)	std	630.207952
9	II	impact force (mN)	std	424.573256
10	III	impact force (mN)	std	124.273849
11	IV	impact force (mN)	std	234.864328
12	I	impact force (mN)	coeff_of_var	0.401419
13	II	impact force (mN)	coeff_of_var	0.585033
14	III	impact force (mN)	coeff_of_var	0.220191
15	IV	impact force (mN)	coeff_of_var	0.546212
16	I	adhesive force (mN)	mean	-658.400000
17	II	adhesive force (mN)	mean	-462.300000
18	III	adhesive force (mN)	mean	-206.750000
19	IV	adhesive force (mN)	mean	-263.600000
20	I	adhesive force (mN)	median	-664.500000
21	II	adhesive force (mN)	median	-517.000000
22	III	adhesive force (mN)	median	-201.500000
23	IV	adhesive force (mN)	median	-233.500000
24	I	adhesive force (mN)	std	167.143619
25	II	adhesive force (mN)	std	203.811600
26	III	adhesive force (mN)	std	88.122448
27	IV	adhesive force (mN)	std	83.309442
28	I	adhesive force (mN)	coeff_of_var	0.247435
29	II	adhesive force (mN)	coeff_of_var	0.429701
30	III	adhesive force (mN)	coeff_of_var	0.415435
31	IV	adhesive force (mN)	coeff_of_var	0.308042

Computing environment¶

[12]:

%load_ext watermark
%watermark -v -p numpy,pandas,jupyterlab

Python implementation: CPython
Python version       : 3.8.10
IPython version      : 7.22.0

numpy     : 1.20.2
pandas    : 1.2.4
jupyterlab: 3.0.14