Exercise 5.3: Bootstrapping “theory” with hacker stats
Say we have a data set with \(n\) unique measurements. It can be shown that on average a fraction of \((1-1/n)^n\) of the measurements do not appear in a bootstrap sample. Note that for large samples, this is approximately \(1/e \approx 1/2.7\), since
\begin{align} \lim_{n\to\infty} (1-1/n)^n = 1/e. \end{align}
Use hacker stats to show that this is, indeed true. Hint: Think about a convenient “data set” to use for drawing samples.
This is kind of fun; you’re investigating some theory behind hacker stats with hacker stats!
Solution
[1]:
import numpy as np
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
We will generate a data set that is just \(n\) consecutive integers. Our statistic from our bootstrap sample is then
\begin{align} \text{fraction omitted} = 1 - \frac{\text{number of unique entries in the bootstrap sample}}{n}. \end{align}
[2]:
def frac_omitted(data):
return 1 - len(np.unique(data)) / len(data)
We will draw a few thousand bootstrap replicates of this for several values of \(n\) and make a plot of the mean. To facilitate, we will use the draw_bs_reps()
function we wrote in Exercise 8.1.
[3]:
rng = np.random.default_rng()
def draw_bs_reps(data, func, rng, size=1, args=()):
return np.array(
[
func(rng.choice(data, replace=True, size=len(data)), *args)
for _ in range(size)
]
)
def mean_frac_omitted(n, n_bs_reps=10000):
data = np.arange(n)
bs_reps = draw_bs_reps(data, frac_omitted, rng, size=n_bs_reps)
return np.mean(bs_reps)
n = np.unique(np.logspace(0, 3).astype(int))
mean_f = np.array([mean_frac_omitted(n_val) for n_val in n])
Now that we have the mean fraction omitted for each value of \(n\), we can plot it, together with the theoretical curve.
[4]:
# Compute theoretical curve
n_theor = np.logspace(0, 3, 400)
mean_f_theor = (1 - 1/n_theor)**n_theor
# Set up figure
p = bokeh.plotting.figure(
frame_height=250,
frame_width=350,
x_axis_label='n',
y_axis_label='mean fraction omitted',
x_axis_type='log',
x_range=[0.9, 1100],
)
# Theoretical curve
p.line(
x=n_theor,
y=mean_f_theor,
line_width=2
)
# Asymptotic result
p.line(
x=[0.9, 1100],
y=[1 / np.exp(1)] * 2,
line_dash='dashed',
line_color='gray',
)
# Result from hacker stats
p.circle(
x=n,
y=mean_f,
color='orange'
)
bokeh.io.show(p)
The result closely matches theory!
Computing environment
[5]:
%load_ext watermark
%watermark -v -p numpy,bokeh,jupyterlab
Python implementation: CPython
Python version : 3.11.3
IPython version : 8.12.0
numpy : 1.24.3
bokeh : 3.1.1
jupyterlab: 3.6.3