Lesson 25: Hacker stats I


[1]:
import numpy as np
import pandas as pd

import bokeh_catplot

import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()
Loading BokehJS ...

When the field of statistics was in its early days, the practitioners did not have computers. They were therefore left to use pen and paper to compute things like confidence intervals. Despite their toils, you will soon see that with just a little bit of programming experience, you can perform lots of the statistical analyses that may seem baffling when done with pen and paper.

At the heart of this “hacker statistics” is the ability to draw random numbers. We will focus on bootstrap methods in particular.

To motivate this study, we will work with data measured by Peter and Rosemary Grant on the island of Daphne Major on the Galápagos. They have been going to the island every year for over forty years and have been taking a careful inventory of the finches there. We will look at the finch Geospiza scandens. The Grants measured the depths of the beaks (defined as the top-to-bottom thickness of the beak) of all of the finches of this species on the island. We will consider their measurements from 1975 and from 2012. We will investigate how the beaks got deeper over time.

The data are from the book Grants’ book 40 years of evolution: Darwin’s finches on Daphne Major Island. They were generous and made their data publicly available on the Dryad data repository. In general, it is a very good idea to put your published data in public data repositories, both to preserve the data and also to make your findings public.

Ok, let’s start by loading in the data. You converted the Grants’ data into a single DataFrame in the exercises. Let’s load the data, which are available in the file ~/git/bootcamp/data/grant_complete.csv.

[2]:
df = pd.read_csv('data/grant_complete.csv', comment='#')

df.head()
[2]:
band beak depth (mm) beak length (mm) species year
0 20123 8.05 9.25 fortis 1973
1 20126 10.45 11.35 fortis 1973
2 20128 9.55 10.15 fortis 1973
3 20129 8.75 9.95 fortis 1973
4 20133 10.15 11.55 fortis 1973

Let’s trim down the data frame to only include G. scandens from 1975 and 2012 and only include the columns we need.

[3]:
df = df.loc[
    (df["species"] == "scandens") & (df["year"].isin([1975, 2012])),
    ["year", "beak depth (mm)"],
]

Let’s take a look at the ECDFs for these two years.

[4]:
p = bokeh_catplot.ecdf(
    data=df,
    cats='year',
    val='beak depth (mm)',
)

bokeh.io.show(p)

Judging from the ECDFs, it seems as though beaks have gotten deeper over time. But now, we would like a statistic to compare. One statistic that comes to mind it the mean. So, let’s compare those. First, we’ll pull out the data sets as NumPy arrays for convenience (and speed later on when we start doing bootstrap replicates).

[5]:
bd_1975 = df.loc[df['year']==1975, 'beak depth (mm)'].values
bd_2012 = df.loc[df['year']==2012, 'beak depth (mm)'].values

np.mean(bd_1975), np.mean(bd_2012)
[5]:
(8.959999999999999, 9.188492063492063)

So, indeed, the mean beak depth is bigger in 2012 than in 1975. There is clearly some variability in beak depth among the birds measured each year, so it is possible that this observation is just due to random chance and the mean beak depth is really not that big. So, we would like to compute a confidence interval of the mean. We will compute the 95% confidence interval.

What is a 95% confidence interval? It can be thought of as follows. If we were to repeat the experiment over and over and over again, 95% of the time, the observed mean would lie in the 95% confidence interval. So, if the confidence intervals of the means of measurements from 1975 and from 2012 overlapped, we might not be so sure that the beaks got deeper due to some underlying selective pressure, but that we just happened to observe deeper beaks as a result of natural variability.

So, how do we compute a confidence interval? ….We use our computer!

Bootstrap confidence intervals

The notion of the bootstrap was first published by Brad Efron in 1979. The idea is simple, and we will take the fact that it works as a given; Efron proved it for us.

Here’s the idea: If we could somehow repeat the measurements of the beak depths on Daphne Major, we could do it many many times, and we could then just compute the 2.5th and 97.5th percentiles to get a 95% confidence interval. The problem is, we can’t repeat the experiments over and over again. 1975 only happened once, and all birds on the island were measured. We cannot have 1975 happen again under exactly the same conditions.

Instead, we will have our computer simulate doing the experiment over and over again. Hacker statistics! We have one set of measurements. We “repeat” the experiment by drawing measurements out of the ones we have again and again. Here’s what we do to compute a bootstrap estimate of the mean of a set of \(n\) data points.

  1. Draw n data points out of the original data set with replacement. This set of data points is called a bootstrap sample.

  2. Compute the mean of the bootstrap sample. This is called a bootstrap replicate of the mean.

  3. Do this over and over again, storing the results.

So, it is as if we did the experiment over and over again, obtaining a mean each time. Remember, our bootstrap sample has exactly the same number of “measurements” as the original data set. Let’s try it with the bd_1975 data (remember the mean beak depth was 8.96 mm). First we’ll generate a bootstrap sample. Remember, the function np.random.choice() allows us to sample out of an array with replacement, if we like.

[6]:
# Seed RNG for ease of commenting on values later;
# generally do not seed the RNG when doing hacker stats
rg = np.random.default_rng(3252)

bs_sample = rg.choice(bd_1975, replace=True, size=len(bd_1975))

Let’s take a quick look at this bootstrap sample by plotting its ECDF right next to that of the original data set.

[7]:
# Original data set
p = bokeh_catplot.ecdf(
    data=df.loc[df['year']==1975, :],
    cats=None,
    val='beak depth (mm)',
)

# Bootstrap data set
p = bokeh_catplot.ecdf(
    data=pd.DataFrame(data={'bootstrap': bs_sample}),
    cats=None,
    val='bootstrap',
    p=p,
    marker_kwargs=dict(
        fill_color=None,
        line_color='gray'
    ),
)

bokeh.io.show(p)

It is qualitatively similar, but obviously not exactly the same data set.

Now, let’s compute our bootstrap replicate. It’s as simple as computing the mean of the bootstrap sample.

[8]:
bs_replicate = np.mean(bs_sample)
bs_replicate
[8]:
8.849770114942531

So, the mean of the bootstrap replicate is 8.84 mm, which is less than the mean of 8.96 from the original data set.

Now, we can write a for loop to get lots and lots of bootstrap replicates. Note the since you are doing the replicates many many times, speed matters. For this reason, be sure you convert the data you are bootstrapping into a Numpy array. The calculations with them are much faster than with Pandas Series.

[9]:
# Number of replicatess
n_reps = 2000

# Initialize bootstrap replicas array
bs_reps_1975 = np.empty(n_reps)

# Compute replicates
for i in range(n_reps):
    bs_sample = rg.choice(bd_1975, size=len(bd_1975))
    bs_reps_1975[i] = np.mean(bs_sample)

Now that we have our replicas, 2,000 of them, we can plot an ECDF to see what we might expect of the mean if we were to do the experiment again.

[10]:
# Original data set
p = bokeh_catplot.ecdf(
    bs_reps_1975,
    x_axis_label='mean beak depth (mm)',
)

bokeh.io.show(p)

It looks Normally distributed, and in fact it must be approximately Normally distributed by the Central Limit Theorem (which we will not discuss here, but we didn’t really need to derive; hacker statistics brought us here!). The most probable mean (located at the inflection point on the CDF) we would get is 8.96 mm, which was what was measured, but upon repeating the experiment, we could get a mean as low as about 8.7 mm or as high as about 9.2 mm.

Let’s compute the 95% confidence interval.

[11]:
conf_int_1975 = np.percentile(bs_reps_1975, [2.5, 97.5])
conf_int_1975
[11]:
array([8.8466523 , 9.07979023])

A function for replicates

The construction we had for making our bootstrap replicates was a bit clunky:

# Initialize bootstrap replicas array
bs_reps_1975 = np.empty(n_reps)

# Compute replicates
for i in range(n_reps):
    bs_sample = rg.choice(bd_1975, size=len(bd_1975))
    bs_reps_1975[i] = np.mean(bs_sample)

We had to set up an empty array, and then loop through each index, draw a bootstrap sample, compute its mean to get the replicate, and then place it in the array. We could, instead, write a function to compute a bootstrap replicate from data.

[12]:
def draw_bs_rep(data, func, rg):
    """Compute a bootstrap replicate from data."""
    bs_sample = rg.choice(data, size=len(data))
    return func(bs_sample)

Note that this function is generic in that it can compute the replicate using any summary statistic function, such as np.median(), np.std(), or anything else. With this function in hand, our code starts to look a little cleaner.

# Initialize bootstrap replicas array
bs_reps_1975 = np.empty(n_reps)

# Compute replicates
for i in range(n_reps):
    bs_reps_1975 = draw_bs_rep(bd_1975, np.mean, rg)

We can use this function in a list comprehension to quickly get an array of replicates.

[13]:
bs_reps_1975 = np.array(
    [draw_bs_rep(bd_1975, np.mean, rg) for _ in range(n_reps)]
)

This is much more concise and perhaps cleaner syntax. Now let’s use this construction to make bootstrap replicates for the 2012 samples.

[14]:
# Compute replicates
bs_reps_2012 = np.array(
    [draw_bs_rep(bd_2012, np.mean, rg) for _ in range(n_reps)]
)

# Compute the confidence interval
conf_int_2012 = np.percentile(bs_reps_2012, [2.5, 97.5])

Now, let’s print the two confidence intervals next to each other for comparison.

[15]:
print(conf_int_1975)
print(conf_int_2012)
[8.8466523  9.07979023]
[9.07458333 9.30000992]

So, the 95% confidence intervals for the 2012 and 1975 juuust overlap. This implies that the inherent variation in beak depths is likely not responsible for the observed difference. There was likely some selective pressure toward deeper beaks.

Plotting and reporting confidence intervals

In the above code cell, we have shown numerical values for the edges of the 95% confidence interval for beak depth in 1975 and 2012. How would you display these results in a publication?

We will start by discussing how you would express confidence intervals in text. You may have seen confidence intervals expressed like this: 10.3 ± 2.1. I never write my confidence intervals this way because confidence intervals are in general not symmetric. Rather, would report the confidence intervals for the above calculations like this:

  • 1975 beak depth: \(8.96^{+0.11}_{-0.12}\) mm.

  • 2012 beak depth: \(9.19^{+0.11}_{-0.11}\) mm.

Better yet, we can show the confidence intervals graphically. I will make a plot and then discuss my design choices and some of the details about how I made the plot.

[16]:
years = ['2012', '1975']
p = bokeh.plotting.figure(
    frame_height=100,
    frame_width=250,
    x_axis_label='beak depth (mm)',
    y_range=years,
)

p.circle([bd_2012.mean(), bd_1975.mean()], years, size=5)
p.line(conf_int_1975, ['1975']*2, line_width=3)
p.line(conf_int_2012, ['2012']*2, line_width=3)

bokeh.io.show(p)

I prefer to make my plots of confidence intervals with error bars horizontal. Most plots I see have them vertical, but I find this counterintuitive. If we are thinking of the confidence interval as a summary of a probability density function or of a cumulative distribution function, the x-axis should have the measured value. Furthermore, the other axis is usually categorical, so it often contains text that is to be read. In this case, they are just years, but in many applications they are things like genotypes, treatments, etc. These are easier to read if they are on the vertical axis because they can easily be read horizontally without rotating text.

To make this plot, I made a categorical y-axis with Bokeh. To do this, when I set up the figure, I specified the y_range kwargs to be a list of strings. This tells Bokeh that the axis is categorical. Then, when I populated the glyphs, I simply enter the values of the strings as the y-axis values.

Equivalence of standard deviation bootstrap samples of the mo and standard error of the mean

The standard error of the mean, or SEM, is a measure of uncertainty of the estimate of the mean. In other words, if we did the set of measurements again, we would get a different mean. The variability in these measured means is described by the SEM. Specifically, it is the standard deviation of the Normal distribution describing the mean of repeated measurements. So, from bootstrap replicates, we can directly apply this formula.

[17]:
bs_sem = np.std(bs_reps_1975)
bs_sem
[17]:
0.06003142350273925

It can be shown analytically that the SEM can be computed directly from the measurements as the standard deviation of the measurements divided by the square root of the number of measurements.

[18]:
sem = np.std(bd_1975, ddof=1) / np.sqrt(len(bd_1975))
sem
[18]:
0.06074539219629801

Hey, we got the same result! You may think I’m a jerk for making you get a simple answer the hard way by bootstrapping. But remember that bootstrap replicates are easy to generate in general for any statistic, and the confidence intervals on those statistics might not be as simple as for the mean, as we will demonstrate now.

Bootstrap confidence interval of the standard deviation

We are not limited to computing bootstrap confidence intervals of the mean. We could compute bootstrap confidence intervals of any statistic, like the median, standard deviation, the standard deviation divided by the mean (coefficient of variation), whatever we like. Computing the confidence interval for the standard deviation is the same procedure as we have done; we just put np.std in for np.mean.

[19]:
# Compute replicates
bs_reps_1975 = np.array([draw_bs_rep(bd_1975, np.std, rg) for _ in range(n_reps)])

# Compute confidence interval
conf_int_1975 = np.percentile(bs_reps_1975, [2.5, 97.5])
print(conf_int_1975)

# Plot ECDF
p = bokeh_catplot.ecdf(
    bs_reps_1975,
    x_axis_label='std beak depth (mm)',
)

bokeh.io.show(p)
[0.47835178 0.63827714]

Note that the distribution for the standard deviation is not Normal; it has a heavier right tail. So, we now also have an estimate for the variability in beak depth. It could range from about 0.48 to 0.64 mm. We could report it like \(0.56_{-0.08}^{+0.07}\) mm.

Computing environment

[20]:
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,bokeh_catplot,jupyterlab
CPython 3.7.7
IPython 7.15.0

numpy 1.18.1
pandas 0.24.2
bokeh 2.1.0
bokeh_catplot 0.1.8
jupyterlab 2.1.4