Confidence interval of an ECDF
[1]:
import pandas as pd
import iqplot
import bokeh.io
bokeh.io.output_notebook()
We can compute confidence intervals for anything directly computed from the data. This includes the value of an ECDF for an arbitrary x. Here is how we can compute the confidence interval for the ECDF.
Generate a bootstrap sample of the data.
For each value x of your data set, evaluate the value of the ECDF at that point and record it. This is a bootstrap replicate of the ECDF at x.
Do steps 1 and 2 over and over until you get your desired number of bootstrap replicates.
For each value of x in your data set, compute the appropriate percentiles of your ECDF replicates. This gives you the ECDF confidence interval at x.
Step 2 is kind of tricky, since not all of your measured data points are present in each bootstrap sample and you have to use the formal definition of the ECDF to get a replicate of the ECDF. Fortunately, this is done for you in the iqplot.ecdf()
function using the conf_int=True
kwarg. I will now use it to plot an ECDFs of the bee sperm data set.
[2]:
df = pd.read_csv("../data/bee_sperm.csv", comment="#")
bokeh.io.show(
iqplot.ecdf(df, q="Alive Sperm Millions", cats="Treatment", conf_int=True)
)
There is a clear separation of the ECDFs, suggesting that the control and pesticide-treated bees are indeed distributed differently, with pesticide-treated sperm counts shifted toward fewer alive sperm.
Computing environment
[3]:
%load_ext watermark
%watermark -v -p pandas,bokeh,iqplot,jupyterlab
Python implementation: CPython
Python version : 3.9.12
IPython version : 8.3.0
pandas : 1.4.2
bokeh : 2.4.2
iqplot : 0.2.5
jupyterlab: 3.3.2