Lessons 31 and 32: Practice with hacker stats solution

(c) 2019 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from a Jupyter notebook. You can download the notebook here.


In [1]:
import numpy as np
import pandas as pd

Practice 1: Writing a function to draw bootstrap replicates

As you can imagine, it is quite useful to be able to generate, or "draw," bootstrap replicates. Now, you will write a function, draw_bs_reps() to do this automatically. You will want to include this in a module so you can use it over and over again. (I will not be providing it in the bootcamp_utils module; I want you to write this yourself.)

  1. Define a function with call signature draw_bs_reps(data, func=np.mean, size=1), where func is a function that takes in an array and returns a statistic. Examples that could be passed in as func are np.mean, np.std, np.median, or a user-defined function. size is the number of replicates to generate.
  2. Write a good doc string.
  3. Define n to be the length of the input data array.
  4. Use a list comprehension to compute a list of bootstrap replicates.
  5. Return the replicates as a Numpy array.

Now that you have the function, feel free to play around with it and compute bootstrap replicates with the finch beak or other data you want to play with. This is a function you may use over and over again in your research, so you might want to keep it in a module you will use going forward. Or, you can install the dc_stat_think module that I wrote using pip, which has this and many other useful functions for bootstrapping.


Practice 2: Estimating the difference between two means

We have used bootstrapping to get estimates of summary statistics about a single set of repeated measurements. But we are often in interested in the difference between two data sets. As an example, in our lesson on hacker stats, we computed the confidence interval for the mean beak depth of G. scandens in 1975 and in 2012. We can easily compute the difference in means between these two years.

In [2]:
# Load data set
df = pd.read_csv("data/grant_complete.csv", comment="#")

# Pull out records of interest
df = df.loc[
    (df["species"] == "scandens") & (df["year"].isin([1975, 2012])),
    ["year", "beak depth (mm)"],
]

# Convert to NumPy arrays
bd_1975 = df.loc[df['year']==1975, 'beak depth (mm)'].values
bd_2012 = df.loc[df['year']==2012, 'beak depth (mm)'].values

# Compute difference of means
print("Difference of means (2012 - 1975): ", 
      np.mean(bd_2012) - np.mean(bd_1975),
      "mm")
Difference of means (2012 - 1975):  0.2284920634920642 mm

We might like a confidence interval on this value. This can also be achieved by bootstrapping! Remember, the ideas is to sample out of the actual measured data set, with replacement, and then compute the summary statistic. For this calculation, it would go as follows.

  1. Draw B bootstrap replicates of the mean out of bd_1975.
  2. Draw B bootstrap replicates of the mean out of bd_2012.
  3. Subtract the replicates from 1975 from those from 2012, and, voilĂ !, you have bootstrap replicates for the difference in means.

Be sure you understand why technique is following the general bootstrap prescription. Code this up, and give a confidence interval for the difference in mean beak depth between 2012 and 1975.


Practice 3: Visualizing ECDF confidence intervals

In the previous exercise, we showed that we can compute confidence intervals for the difference of means. We can compute confidence intervals for anything directly computed from the data. This includes the value of the ECDF for an arbitrary x. Here is how we can compute the confidence interval for the ECDF.

  1. Generate a bootstrap sample of the data.
  2. For each value x of your data set, evaluate the value of the ECDF at that point and record it. This is a bootstrap replicate of the ECDF at x.
  3. Do steps 1 and 2 over and over until you get your desired number of bootstrap replicates.
  4. For each value of x in your data set, compute the appropriate percentiles of your ECDF replicates. This gives you the ECDF confidence interval at x.

Step 2 is kind of tricky, since not all of your measured data points are present in each bootstrap sample and you have to use the formal definition of the ECDF to get a replicate of the ECDF. Fortunately, this is done for you in the bokeh_catplot.ecdf() function. Use the conf_int kwarg in the bokeh_catplot.ecdf() function to plot an ECDFs of the G. scandens beak depth for 1975 and 2012 with confidence intervals.


Practice 4: Bootstrapping "theory" with hacker stats

Say we have a data set with $n$ unique measurements. It can be shown that on average a fraction of $(1-1/n)^n$ of the measurements do not appear in a bootstrap sample. Note that for large samples, this is approximately $1/e \approx 1/2.7$, since

\begin{align} \lim_{n\to\infty} (1-1/n)^n = 1/e. \end{align}

Use hacker stats to show that this is, indeed true. Hint: Think about a convenient "data set" to use for drawing samples.

This is kind of fun; you're investigating some theory behind hacker stats with hacker stats!

Computing environment

In [3]:
%load_ext watermark
%watermark -v -p numpy,pandas,jupyterlab
CPython 3.7.3
IPython 7.1.1

numpy 1.16.4
pandas 0.24.2
jupyterlab 0.35.5