Lesson 26: Hacker stats II


[1]:
%load_ext blackcellmagic

import numpy as np
import pandas as pd

import bokeh_catplot

import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()
Loading BokehJS ...

We have learned that we can compute confidence intervals of quantities calculated from data using bootstrapping. In the last lesson, the examples included bootstrap confidence intervals of an aggregating summary statistic (such as a mean, median, standard deviation, etc.) derived from a single set of measurements.

In this lesson, we explore how you may compute confidence intervals for more complex summaries of data sets. We will again use the data set from Peter and Rosemary Grant, which we load in now, again only using G scandens in 1975 and 2012. We will also use beak length in addition to beak depth in this lesson.

[2]:
df = pd.read_csv("data/grant_complete.csv", comment="#")
df = df.loc[(df["species"] == "scandens") & (df["year"].isin([1975, 2012])), :]

df.head()
[2]:
band beak depth (mm) beak length (mm) species year
401 302 8.4 13.9 scandens 1975
402 304 8.8 14.0 scandens 1975
403 306 8.4 12.9 scandens 1975
404 310 8.0 13.5 scandens 1975
405 317 7.9 12.9 scandens 1975

Finally, before we begin out analysis, we will instantiate a random number generator.

[3]:
rg = np.random.default_rng()

The difference of means

In the last lesson, we plotted the mean beak depth with confidence intervals for G. scandens birds measured in 1975 and 2012. We then compared the respective values. But what if we were interested in the difference of the means? We can easily compute that difference.

[4]:
# Extract relevant beak depths as Numpy arrays
bd_1975 = df.loc[df['year']==1975, 'beak depth (mm)'].values
bd_2012 = df.loc[df['year']==2012, 'beak depth (mm)'].values

# Report difference of means
np.mean(bd_2012) - np.mean(bd_1975)
[4]:
0.2284920634920642

So, the beaks got deeper by about 225 microns. But how do we get a confidence interval on this value? We repeat the same bootstrapping procedure as before.

  1. Obtain a bootstrap replicate for the mean of the 2012 beak depths.

  2. Obtain a bootstrap replicate for the mean of the 1975 beak depths.

  3. Subtract the two to get a replicate for the difference of mean beak depths.

Let’s implement this. We can use the draw_bs_rep() function we wrote in the previous lesson.

[5]:
def draw_bs_rep(data, func, rg):
    """Compute a bootstrap replicate from data."""
    bs_sample = rg.choice(data, size=len(data))
    return func(bs_sample)


# Replicates of each year
n_reps = 2000
bs_reps_1975 = np.array(
    [draw_bs_rep(bd_1975, np.mean, rg) for _ in range(n_reps)]
)
bs_reps_2012 = np.array(
    [draw_bs_rep(bd_2012, np.mean, rg) for _ in range(n_reps)]
)

# Get replicates of the difference
bd_reps_diff = bs_reps_2012 - bs_reps_1975

# Report the confidence interval
np.percentile(bd_reps_diff, [2.5, 97.5])
[5]:
array([0.05954324, 0.40111022])

Pairs bootstrap

Now let’s consider beak length in addition to beak depth. We will only work with the 1975 data for this. First, we’ll check out the length versus depth graphically.

[6]:
p = bokeh.plotting.figure(
    frame_width=250,
    frame_height=250,
    x_axis_label='beak depth (mm)',
    y_axis_label='beak length (mm)',
)

# Extract beak lengths
bl_1975 = df.loc[df['year']==1975, 'beak length (mm)'].values

p.circle(bd_1975, bl_1975)

bokeh.io.show(p)

We can think about the relationship between beak length and depth in several ways. There seems to be a strong correlation between length and depth; that is the overall shape of the beak is roughly the same, it just changes in scale. The length-to-depth ratio should therefore be similar for each bird. So, we seek both an estimate for the correlation coefficient between lengths and depths of beaks and also an estimate of the length-to-depth ratio.

We will start with the ratio. Let’s compute and plot its ECDF first.

[7]:
p = bokeh_catplot.ecdf(
    bl_1975 / bd_1975,
    x_axis_label='length/depth',
)

bokeh.io.show(p)

To compute an estimate for the mean ratio and its bootstrap confidence interval, we need to do pairs bootstrap. This means that we draw length and depth pairs together to get a bootstrap sample. To do this, we randomly sample the indices of the measurements with replacement. We then keep data points in each of the two arrays we are sampling out of corresponding to the bootstrap indices.

[8]:
def draw_bs_pairs(x, y, rg):
    """Draw pairs of data points out of `x` and `y`"""
    # Get indices of bootstrap sample
    bs_inds = rg.choice(np.arange(len(x)), len(x), replace=True)

    return x[bs_inds], y[bs_inds]

Now that we can draw a pairs sample, we can compute whatever statistic we like from the pairs to get a bootstrap replicate. In this case, we take the ratio of each pair and then compute the mean of the ratios. We could do this with a comprehension, but it would need an anonymous function, which we are not covering in the bootcamp. So, we will use a for loop.

[9]:
ratio_bs_reps = np.empty(n_reps)

for i in range(n_reps):
    bd, bl = draw_bs_pairs(bd_1975, bl_1975, rg)
    ratio_bs_reps[i] = np.mean(bl / bd)

Now that we have our reps, we can compute the 95% confidence interval.

[10]:
np.percentile(ratio_bs_reps, [2.5, 97.5])
[10]:
array([1.56261803, 1.59580993])

We can do a similar calculation for the correlation. To compute the correlation, we need the covariance matrix,

\begin{align} \mathsf{\Sigma} = \begin{pmatrix} \sigma_d^2 & \sigma_{dl} \\ \sigma_{dl} & \sigma_l^2 \end{pmatrix}. \end{align}

The correlation is \(\rho = \sigma_{dl} / \sigma_d \sigma_l\). We can use np.cov to compute the covariance matrix, and then compute \(\rho\).

[11]:
correlation_bs_reps = np.empty(n_reps)

for i in range(n_reps):
    bd, bl = draw_bs_pairs(bd_1975, bl_1975, rg)
    cov = np.cov(bd, bl)
    correlation_bs_reps[i] = cov[0, 1] / np.sqrt(cov[0, 0] * cov[1, 1])

And we can again compute the confidence interval by taking percentiles.

[12]:
np.percentile(correlation_bs_reps, [2.5, 97.5])
[12]:
array([0.46249291, 0.75098053])

Bootstrap confidence interval of an ECDF

We can compute confidence intervals for anything directly computed from the data. This includes the value of an ECDF for an arbitrary x. Here is how we can compute the confidence interval for the ECDF.

  1. Generate a bootstrap sample of the data.

  2. For each value x of your data set, evaluate the value of the ECDF at that point and record it. This is a bootstrap replicate of the ECDF at x.

  3. Do steps 1 and 2 over and over until you get your desired number of bootstrap replicates.

  4. For each value of x in your data set, compute the appropriate percentiles of your ECDF replicates. This gives you the ECDF confidence interval at x.

Step 2 is kind of tricky, since not all of your measured data points are present in each bootstrap sample and you have to use the formal definition of the ECDF to get a replicate of the ECDF. Fortunately, this is done for you in the bokeh_catplot.ecdf() function using the conf_int kwarg. I will not use it to plot an ECDFs of the G. scandens beak depth for 1975 and 2012 with confidence intervals.

[13]:
p = bokeh_catplot.ecdf(
    data=df,
    cats='year',
    val='beak depth (mm)',
    style='staircase',
    conf_int=True
)

bokeh.io.show(p)

There is some overlap in the confidence intervals, especially at the tails, probably because birds with extreme beak depths are rare, but the middle range of the ECDFs have less overlap.

Conclusions

We have seen that we can use random number generation to directly simulate doing an experiment over and over again to get confidence intervals of a wide variety of statistical values. Note that in all of this, we have never assumed a statistical model for and of our data. We never made assumptions about normality or anything else. This branch of statistical inference is called nonparametric statistics—there are no parameters; everything is computed directly from the data. This highlights the power of bootstrap methods; they are direct and avoid unnecessary assumptions.

That said, there is much to be said for parametric statistics. That is a broad topic we will not cover in the bootcamp.

Computing environment

[14]:
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,bokeh_catplot,jupyterlab
CPython 3.7.7
IPython 7.15.0

numpy 1.18.1
pandas 0.24.2
bokeh 2.1.0
bokeh_catplot 0.1.8
jupyterlab 2.1.4