Exercise 5.3: Bootstrapping “theory” with hacker stats

Say we have a data set with \(n\) unique measurements. It can be shown that on average a fraction of \((1-1/n)^n\) of the measurements do not appear in a bootstrap sample. Note that for large samples, this is approximately \(1/e \approx 1/2.7\), since

\begin{align} \lim_{n\to\infty} (1-1/n)^n = 1/e. \end{align}

Use hacker stats to show that this is, indeed true. Hint: Think about a convenient “data set” to use for drawing samples.

This is kind of fun; you’re investigating some theory behind hacker stats with hacker stats!

Solution

[1]:

import numpy as np

import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()

Loading BokehJS ...

We will generate a data set that is just \(n\) consecutive integers. Our statistic from our bootstrap sample is then

\begin{align} \text{fraction omitted} = 1 - \frac{\text{number of unique entries in the bootstrap sample}}{n}. \end{align}

[2]:

def frac_omitted(data):
    return 1 - len(np.unique(data)) / len(data)

We will draw a few thousand bootstrap replicates of this for several values of \(n\) and make a plot of the mean. To facilitate, we will use the draw_bs_reps() function we wrote in Exercise 8.1.

[3]:

rg = np.random.default_rng()

def draw_bs_reps(data, func, rg, size=1, args=()):
    return np.array(
        [
            func(rg.choice(data, replace=True, size=len(data)), *args)
            for _ in range(size)
        ]
    )


def mean_frac_omitted(n, n_bs_reps=10000):
    data = np.arange(n)
    bs_reps = draw_bs_reps(data, frac_omitted, rg, size=n_bs_reps)

    return np.mean(bs_reps)


n = np.unique(np.logspace(0, 3).astype(int))
mean_f = np.array([mean_frac_omitted(n_val) for n_val in n])

Now that we have the mean fraction omitted for each value of \(n\), we can plot it, together with the theoretical curve.

[4]:

# Compute theoretical curve
n_theor = np.logspace(0, 3, 400)
mean_f_theor = (1 - 1/n_theor)**n_theor

# Set up figure
p = bokeh.plotting.figure(
    frame_height=250,
    frame_width=350,
    x_axis_label='n',
    y_axis_label='mean fraction omitted',
    x_axis_type='log',
    x_range=[0.9, 1100],
)

# Theoretical curve
p.line(
    x=n_theor,
    y=mean_f_theor,
    line_width=2
)

# Asymptotic result
p.line(
    x=[0.9, 1100],
    y=[1 / np.exp(1)] * 2,
    line_dash='dashed',
    line_color='gray',
)

# Result from hacker stats
p.circle(
    x=n,
    y=mean_f,
    color='orange'
)

bokeh.io.show(p)

The result closely matches theory!

Computing environment

[5]:

%load_ext watermark
%watermark -v -p numpy,bokeh,jupyterlab

Python implementation: CPython
Python version       : 3.9.12
IPython version      : 8.3.0

numpy     : 1.21.5
bokeh     : 2.4.2
jupyterlab: 3.3.2