Exercise 3.2: Split-Apply-Combine of the frog data set


We will continue working with the frog tongue adhesion data set.

You’ll now practice your split-apply-combine skills. First load in the data set. Then,

a) Compute standard deviation of the impact forces for each frog.

b) Compute the coefficient of variation of the impact forces and adhesive forces for each frog.

c) Compute a data frame that has the mean, median, standard deviation, and coefficient of variation of the impact forces and adhesive forces for each frog.

d) Now tidy this data frame. It might help to read the documentation about melting.

Solution


[1]:
import numpy as np
import pandas as pd

Of course, we start by loading in the data frame.

[2]:
df = pd.read_csv('data/frog_tongue_adhesion.csv', comment='#')

a) To compute the standard deviation of impact forces for each frog, we first group by the frog ID and then apply the std() method to the GroupBy object.

[3]:
grouped = df.groupby('ID')

grouped['impact force (mN)'].std()
[3]:
ID
I      630.207952
II     424.573256
III    124.273849
IV     234.864328
Name: impact force (mN), dtype: float64

We could also do this by method chaining.

[4]:
df.groupby('ID')['impact force (mN)'].std()
[4]:
ID
I      630.207952
II     424.573256
III    124.273849
IV     234.864328
Name: impact force (mN), dtype: float64

b) We first write a function to compute the coefficient of variation. We saw this in our lesson on tidy data and split-apply-combine.

[5]:
def coeff_of_var(data):
    """Coefficient of variation."""
    return np.std(data) / np.abs(np.mean(data))

Next, we group the data frame with by the ID column.

[6]:
grouped = df.groupby('ID')

Finally, we select the columns of interest and apply the coeff_of_var() aggregating function.

[7]:
df_cv = grouped[['impact force (mN)', 'adhesive force (mN)']].agg(coeff_of_var)

Finally, we reset the index to get our resulting data frame.

[8]:
df_cv.reset_index()
[8]:
ID impact force (mN) adhesive force (mN)
0 I 0.401419 0.247435
1 II 0.585033 0.429701
2 III 0.220191 0.415435
3 IV 0.546212 0.308042

Again, we could do this in one line by method chaining.

[9]:
df.groupby("ID")[["impact force (mN)", "adhesive force (mN)"]].agg(
    coeff_of_var
).reset_index()
[9]:
ID impact force (mN) adhesive force (mN)
0 I 0.401419 0.247435
1 II 0.585033 0.429701
2 III 0.220191 0.415435
3 IV 0.546212 0.308042

c) Now we will apply all of the statistical functions to the impact force and adhesive force. This is as simple as using a list of aggregating functions in the agg() method of the GroupBy object.

[10]:
df_result = (
    df.groupby("ID")[["impact force (mN)", "adhesive force (mN)"]]
    .agg(['mean', 'median', 'std', coeff_of_var])
    .reset_index()
)

# Take a look
df_result
[10]:
ID impact force (mN) adhesive force (mN)
mean median std coeff_of_var mean median std coeff_of_var
0 I 1530.20 1550.5 630.207952 0.401419 -658.40 -664.5 167.143619 0.247435
1 II 707.35 573.0 424.573256 0.585033 -462.30 -517.0 203.811600 0.429701
2 III 550.10 544.0 124.273849 0.220191 -206.75 -201.5 88.122448 0.415435
3 IV 419.10 460.5 234.864328 0.546212 -263.60 -233.5 83.309442 0.308042

d) We can index these things using the MultiIndex of the columns, but we much prefer tidy data frames, which we can generate again use pd.melt(). The var_name kwarg specifies what we name the levels of the MultiIndex, and the id_vars kwarg tells us which column to keep as it is.

[11]:
# Melt the DataFrame to make it tidy
pd.melt(df_result, id_vars=[("ID", "")]).rename(
    columns={("ID", ""): "ID", "variable_0": "quantity", "variable_1": "statistic"}
)
[11]:
ID quantity statistic value
0 I impact force (mN) mean 1530.200000
1 II impact force (mN) mean 707.350000
2 III impact force (mN) mean 550.100000
3 IV impact force (mN) mean 419.100000
4 I impact force (mN) median 1550.500000
5 II impact force (mN) median 573.000000
6 III impact force (mN) median 544.000000
7 IV impact force (mN) median 460.500000
8 I impact force (mN) std 630.207952
9 II impact force (mN) std 424.573256
10 III impact force (mN) std 124.273849
11 IV impact force (mN) std 234.864328
12 I impact force (mN) coeff_of_var 0.401419
13 II impact force (mN) coeff_of_var 0.585033
14 III impact force (mN) coeff_of_var 0.220191
15 IV impact force (mN) coeff_of_var 0.546212
16 I adhesive force (mN) mean -658.400000
17 II adhesive force (mN) mean -462.300000
18 III adhesive force (mN) mean -206.750000
19 IV adhesive force (mN) mean -263.600000
20 I adhesive force (mN) median -664.500000
21 II adhesive force (mN) median -517.000000
22 III adhesive force (mN) median -201.500000
23 IV adhesive force (mN) median -233.500000
24 I adhesive force (mN) std 167.143619
25 II adhesive force (mN) std 203.811600
26 III adhesive force (mN) std 88.122448
27 IV adhesive force (mN) std 83.309442
28 I adhesive force (mN) coeff_of_var 0.247435
29 II adhesive force (mN) coeff_of_var 0.429701
30 III adhesive force (mN) coeff_of_var 0.415435
31 IV adhesive force (mN) coeff_of_var 0.308042

Computing environment

[12]:
%load_ext watermark
%watermark -v -p numpy,pandas,jupyterlab
Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.20.0

numpy     : 1.26.4
pandas    : 2.2.1
jupyterlab: 4.0.13