Exercise 3.2: Split-Apply-Combine of the frog data set
We will continue working with the frog tongue adhesion data set.
You’ll now practice your split-apply-combine skills. First load in the data set. Then,
a) Compute standard deviation of the impact forces for each frog.
b) Compute the coefficient of variation of the impact forces and adhesive forces for each frog.
c) Compute a data frame that has the mean, median, standard deviation, and coefficient of variation of the impact forces and adhesive forces for each frog.
d) Now tidy this data frame. It might help to read the documentation about melting.
Solution
[1]:
import numpy as np
import pandas as pd
Of course, we start by loading in the data frame.
[2]:
df = pd.read_csv('data/frog_tongue_adhesion.csv', comment='#')
a) To compute the standard deviation of impact forces for each frog, we first group by the frog ID and then apply the std()
method to the GroupBy
object.
[3]:
grouped = df.groupby('ID')
grouped['impact force (mN)'].std()
[3]:
ID
I 630.207952
II 424.573256
III 124.273849
IV 234.864328
Name: impact force (mN), dtype: float64
We could also do this by method chaining.
[4]:
df.groupby('ID')['impact force (mN)'].std()
[4]:
ID
I 630.207952
II 424.573256
III 124.273849
IV 234.864328
Name: impact force (mN), dtype: float64
b) We first write a function to compute the coefficient of variation. We saw this in our lesson on tidy data and split-apply-combine.
[5]:
def coeff_of_var(data):
"""Coefficient of variation."""
return np.std(data) / np.abs(np.mean(data))
Next, we group the data frame with by the ID column.
[6]:
grouped = df.groupby('ID')
Finally, we select the columns of interest and apply the coeff_of_var()
aggregating function.
[7]:
df_cv = grouped[['impact force (mN)', 'adhesive force (mN)']].agg(coeff_of_var)
Finally, we reset the index to get our resulting data frame.
[8]:
df_cv.reset_index()
[8]:
ID | impact force (mN) | adhesive force (mN) | |
---|---|---|---|
0 | I | 0.401419 | 0.247435 |
1 | II | 0.585033 | 0.429701 |
2 | III | 0.220191 | 0.415435 |
3 | IV | 0.546212 | 0.308042 |
Again, we could do this in one line by method chaining.
[9]:
df.groupby("ID")[["impact force (mN)", "adhesive force (mN)"]].agg(
coeff_of_var
).reset_index()
[9]:
ID | impact force (mN) | adhesive force (mN) | |
---|---|---|---|
0 | I | 0.401419 | 0.247435 |
1 | II | 0.585033 | 0.429701 |
2 | III | 0.220191 | 0.415435 |
3 | IV | 0.546212 | 0.308042 |
c) Now we will apply all of the statistical functions to the impact force and adhesive force. This is as simple as using a list of aggregating functions in the agg()
method of the GroupBy
object.
[10]:
df_result = (
df.groupby("ID")[["impact force (mN)", "adhesive force (mN)"]]
.agg([np.mean, np.median, np.std, coeff_of_var])
.reset_index()
)
# Take a look
df_result
[10]:
ID | impact force (mN) | adhesive force (mN) | |||||||
---|---|---|---|---|---|---|---|---|---|
mean | median | std | coeff_of_var | mean | median | std | coeff_of_var | ||
0 | I | 1530.20 | 1550.5 | 630.207952 | 0.401419 | -658.40 | -664.5 | 167.143619 | 0.247435 |
1 | II | 707.35 | 573.0 | 424.573256 | 0.585033 | -462.30 | -517.0 | 203.811600 | 0.429701 |
2 | III | 550.10 | 544.0 | 124.273849 | 0.220191 | -206.75 | -201.5 | 88.122448 | 0.415435 |
3 | IV | 419.10 | 460.5 | 234.864328 | 0.546212 | -263.60 | -233.5 | 83.309442 | 0.308042 |
d) We can index these things using the MultiIndex
of the columns, but we much prefer tidy data frames, which we can generate again use pd.melt()
. The var_name
kwarg specifies what we name the levels of the MultiIndex
, and the id_vars
kwarg tells us which column to keep as it is.
[11]:
# Melt the DataFrame to make it tidy
pd.melt(df_result, var_name=['quantity', 'statistic'], id_vars='ID')
[11]:
ID | quantity | statistic | value | |
---|---|---|---|---|
0 | I | impact force (mN) | mean | 1530.200000 |
1 | II | impact force (mN) | mean | 707.350000 |
2 | III | impact force (mN) | mean | 550.100000 |
3 | IV | impact force (mN) | mean | 419.100000 |
4 | I | impact force (mN) | median | 1550.500000 |
5 | II | impact force (mN) | median | 573.000000 |
6 | III | impact force (mN) | median | 544.000000 |
7 | IV | impact force (mN) | median | 460.500000 |
8 | I | impact force (mN) | std | 630.207952 |
9 | II | impact force (mN) | std | 424.573256 |
10 | III | impact force (mN) | std | 124.273849 |
11 | IV | impact force (mN) | std | 234.864328 |
12 | I | impact force (mN) | coeff_of_var | 0.401419 |
13 | II | impact force (mN) | coeff_of_var | 0.585033 |
14 | III | impact force (mN) | coeff_of_var | 0.220191 |
15 | IV | impact force (mN) | coeff_of_var | 0.546212 |
16 | I | adhesive force (mN) | mean | -658.400000 |
17 | II | adhesive force (mN) | mean | -462.300000 |
18 | III | adhesive force (mN) | mean | -206.750000 |
19 | IV | adhesive force (mN) | mean | -263.600000 |
20 | I | adhesive force (mN) | median | -664.500000 |
21 | II | adhesive force (mN) | median | -517.000000 |
22 | III | adhesive force (mN) | median | -201.500000 |
23 | IV | adhesive force (mN) | median | -233.500000 |
24 | I | adhesive force (mN) | std | 167.143619 |
25 | II | adhesive force (mN) | std | 203.811600 |
26 | III | adhesive force (mN) | std | 88.122448 |
27 | IV | adhesive force (mN) | std | 83.309442 |
28 | I | adhesive force (mN) | coeff_of_var | 0.247435 |
29 | II | adhesive force (mN) | coeff_of_var | 0.429701 |
30 | III | adhesive force (mN) | coeff_of_var | 0.415435 |
31 | IV | adhesive force (mN) | coeff_of_var | 0.308042 |
Computing environment
[12]:
%load_ext watermark
%watermark -v -p numpy,pandas,jupyterlab
Python implementation: CPython
Python version : 3.11.3
IPython version : 8.12.0
numpy : 1.24.3
pandas : 1.5.3
jupyterlab: 3.6.3