(c) 2019 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This tutorial was generated from a Jupyter notebook. You can download the notebook here.
import numpy as np
import pandas as pd
import bokeh_catplot
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
As you can imagine, it is quite useful to be able to generate, or "draw," bootstrap replicates. Now, you will write a function, draw_bs_reps()
to do this automatically. You will want to include this in a module so you can use it over and over again. (I will not be providing it in the bootcamp_utils
module; I want you to write this yourself.)
- Define a function with call signature
draw_bs_reps(data, func=np.mean, size=1)
, wherefunc
is a function that takes in an array and returns a statistic. Examples that could be passed in asfunc
arenp.mean
,np.std
,np.median
, or a user-defined function.size
is the number of replicates to generate.- Write a good doc string.
- Define
n
to be the length of the inputdata
array.- Use a list comprehension to compute a list of bootstrap replicates.
- Return the replicates as a Numpy array.
Now that you have the function, feel free to play around with it and compute bootstrap replicates with the finch beak or other data you want to play with. This is a function you may use over and over again in your research, so you might want to keep it in a module you will use going forward. Or, you can install the dc_stat_think
module that I wrote using pip
, which has this and many other useful functions for bootstrapping.
I show the function below.
def draw_bs_reps(data, func=np.mean, size=1):
"""Draw bootstrap replicates from a data set."""
return np.array([func(np.random.choice(data, replace=True, size=len(data)))
for _ in range(size)])
Let's try this function out on the beak depth data from 1975 to get bootstrap replicates of the mean.
df = pd.read_csv('data/grant_1975.csv', comment='#')
bd_1975 = df['Beak depth, mm'].values
# Compute replicates
bs_reps = draw_bs_reps(bd_1975, np.mean, size=2000)
# 95% confidence interval
print(np.percentile(bs_reps, [2.5, 97.5]))
Nice!
We have used bootstrapping to get estimates of summary statistics about a single set of repeated measurements. But we are often in interested in the difference between two data sets. As an example, in our lesson on hacker stats, we computed the confidence interval for the mean beak depth of G. scandens in 1975 and in 2012. We can easily compute the difference in means between these two years.
# Load data set
df = pd.read_csv("data/grant_complete.csv", comment="#")
# Pull out records of interest
df = df.loc[
(df["species"] == "scandens") & (df["year"].isin([1975, 2012])),
["year", "beak depth (mm)"],
]
# Convert to NumPy arrays
bd_1975 = df.loc[df['year']==1975, 'beak depth (mm)'].values
bd_2012 = df.loc[df['year']==2012, 'beak depth (mm)'].values
# Compute difference of means
print("Difference of means (2012 - 1975): ",
np.mean(bd_2012) - np.mean(bd_1975),
"mm")
We might like a confidence interval on this value. This can also be achieved by bootstrapping! Remember, the ideas is to sample out of the actual measured data set, with replacement, and then compute the summary statistic. For this calculation, it would go as follows.
B
bootstrap replicates of the mean out of bd_1975
.B
bootstrap replicates of the mean out of bd_2012
.Be sure you understand why technique is following the general bootstrap prescription. Code this up, and give a confidence interval for the difference in mean beak depth between 2012 and 1975.
We can write a function to compute bootstrap replicates for the difference of arbitrary summary statistics and use that.
def draw_bs_reps_diff(data_1, data_2, func=np.mean, size=1):
"""Draw bootstrap replicates for the difference in summary
statistic given by `func`. E.g., func(data_1) - func(data_2)."""
return draw_bs_reps(data_1, func, size) - draw_bs_reps(data_2, func, size)
# Draw the replicates
bs_reps = draw_bs_reps_diff(bd_2012, bd_1975, size=2000)
# Compute confidence interval
np.percentile(bs_reps, [2.5, 97.5])
The 95% confidence interval does not cross zero, suggesting that there is a real difference in mean deak depth; that is it not just to to random noise and small sample size.
In the previous exercise, we showed that we can compute confidence intervals for the difference of means. We can compute confidence intervals for anything directly computed from the data. This includes the value of the ECDF for an arbitrary x
. Here is how we can compute the confidence interval for the ECDF.
x
of your data set, evaluate the value of the ECDF at that point and record it. This is a bootstrap replicate of the ECDF at x
.x
in your data set, compute the appropriate percentiles of your ECDF replicates. This gives you the ECDF confidence interval at x
.Step 2 is kind of tricky, since not all of your measured data points are present in each bootstrap sample and you have to use the formal definition of the ECDF to get a replicate of the ECDF. Fortunately, this is done for you in the bokeh_catplot.ecdf()
function. Use the conf_int
kwarg in the bokeh_catplot.ecdf()
function to plot an ECDFs of the G. scandens beak depth for 1975 and 2012 with confidence intervals.
p = bokeh_catplot.ecdf(
data=df,
cats='year',
val='beak depth (mm)',
formal=True,
conf_int=True
)
bokeh.io.show(p)
There is some overlap in the confidence intervals, especially at the tails, probably because birds with extreme beak depths are rate, but the middle range of the ECDFs have less overlap.
Say we have a data set with $n$ unique measurements. It can be shown that on average a fraction of $(1-1/n)^n$ of the measurements do not appear in a bootstrap sample. Note that for large samples, this is approximately $1/e \approx 1/2.7$, since
\begin{align} \lim_{n\to\infty} (1-1/n)^n = 1/e. \end{align}Use hacker stats to show that this is, indeed true. Hint: Think about a convenient "data set" to use for drawing samples.
This is kind of fun; you're investigating some theory behind hacker stats with hacker stats!
We will generate a data set that is just $n$ consecutive integers. Our statistic from our bootstrap sample is then
\begin{align} \text{fraction omitted} = 1 - \frac{\text{number of unique entries in the bootstrap sample}}{n}. \end{align}def frac_omitted(data):
return 1 - len(np.unique(data)) / len(data)
We will draw a few thousand bootstrap replicates of this for several values of $n$ and make a plot of the mean.
def mean_frac_omitted(n, n_bs_reps=10000):
data = np.arange(n)
bs_reps = draw_bs_reps(data, func=frac_omitted, size=n_bs_reps)
return np.mean(bs_reps)
n = np.unique(np.logspace(0, 3).astype(int))
mean_f = np.array([mean_frac_omitted(n_val) for n_val in n])
Now that we have the mean fraction omitted for each value of $n$, we can plot it, together with the theoretical curve.
# Compute theoretical curve
n_theor = np.logspace(0, 3, 400)
mean_f_theor = (1 - 1/n_theor)**n_theor
# Set up figure
p = bokeh.plotting.figure(
height=300,
width=400,
x_axis_label='n',
y_axis_label='mean fraction omitted',
x_axis_type='log',
x_range=[0.9, 1100],
)
# Theoretical curve
p.line(
x=n_theor,
y=mean_f_theor,
line_width=2
)
# Asymptotic result
p.line(
x=[0.9, 1100],
y=[1/np.exp(1)]*2,
line_dash='dashed',
line_color='gray',
)
# Result from hacker stats
p.circle(
x=n,
y=mean_f,
color='orange'
)
bokeh.io.show(p)
The result close matches theory!
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,bokeh_catplot,jupyterlab