Exercise 4.1: Confidence intervals for microtubule catastrophe
Refresh yourself about the microtubule catastrophe data we have already explored in previous exercises. We will again work with this data set here.
a) Remember that the confidence interval of the plug-in estimate of any statistical functional may be computed using bootstrapping. (This does not mean, however, that bootstrapping has great performance for any statistical functional; some have better behavior that others.) This includes the ECDF itself. Computing and plotting confidence intervals are implemented in the iqplot.ecdf()
function. Plot the ECDFs of the catastrophe times for microtubules with labeled tubulin and for those
with unlabeled tubulin including a confidence interval. In looking at the plot, do you think they two could be identically distributed?
b) Compute confidence intervals for the plug-in estimate for the mean time to catastrophe for each of the two conditions and comment on the result.
c) In part (b), you used bootstrapping to compute a confidence interval for the plug-in estimate for the mean time to catastrophe. As is often (though definitely not always) the case, we could use a theoretical result to construct a confidence interval. The central limit theorem states that the mean, which is the sum of many processes, should be approximately Normally distributed. We will not derive it here, but the mean and variance of that Normal distribution can be estimated as
\begin{align} &\mu = \bar{x},\\[1em] &\sigma^2 = \frac{1}{n(n-1)}\sum_{i=1}^n (x_i - \bar{x})^2, \end{align}
where \(\bar{x}\) is the arithmetic mean of the data points. To compute a confidence interval of the mean, then, you can compute the interval over which 95% of the probability mass of the above described Normal distribution lies. Compute this approximate confidence interval and compare it to the result you got in part (b). Hint: You can use the scipy.stats
package to conveniently get intervals for named distributions.
d) We could alternatively use the Dvoretzky-Kiefer-Wolfowitz Inequality (DKW) to compute (bounds on) confidence intervals for an ECDF. The DKW inequality puts an upper bound on the maximum distance between the ECDF \(\hat{F}(x)\) and the generative CDF \(F(x)\). It states that, for any \(\epsilon > 0\),
\begin{align} P\left(\mathrm{sup}_x \left|F(x) - \hat{F}(x)\right| > \epsilon\right) \le 2\mathrm{e}^{-2 n \epsilon^2}, \end{align}
where \(n\) is the number of points in the data set. We could use this inequality to set up a bound for the confidence interval. To construct the bound on the \(100 \times (1-\alpha)\) percent confidence interval, we specify that
\begin{align} \alpha = 2\mathrm{e}^{-2 n \epsilon^2}, \end{align}
which gives
\begin{align} \epsilon = \sqrt{\frac{1}{2n}\,\log \frac{2}{\alpha}}. \end{align}
Then, the lower bound on the confidence interval is
\begin{align} L(x) = \max\left(0, \hat{F}(x) - \epsilon\right), \end{align}
and the upper bound is
\begin{align} U(x) = \min\left(1, \hat{F}(x) + \epsilon\right). \end{align}
Note that this is not strictly speaking a confidence interval, but rather a set of bounds for where the confidence interval can lie (it’s the DKW inequality after all).
Plot the upper and lower bounds for the 95% confidence interval as computed from the DKW inequality for the microtubule catastrophe data and comment on how the DKW bounds compare to bootstrap confidence intervals of the ECDF.