Parallel bootstrap calculations
[2]:
try:
import multiprocess
except:
import multiprocessing as multiprocess
import warnings
import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st
import tqdm
import bebi103
import bokeh.io
bokeh.io.output_notebook()
/Users/bois/opt/anaconda3/lib/python3.8/site-packages/arviz/__init__.py:317: UserWarning: Trying to register the cmap 'cet_gray' which already exists.
register_cmap("cet_" + name, cmap=cmap)
/Users/bois/opt/anaconda3/lib/python3.8/site-packages/arviz/__init__.py:317: UserWarning: Trying to register the cmap 'cet_gray_r' which already exists.
register_cmap("cet_" + name, cmap=cmap)
In the last part of this lesson, we saw that computing the MLE confidence intervals can take some time (several minutes). To speed up calculations, we can compute the bootstrap replicates in parallel. That is, we split the sampling job up into parts, and different processing cores work on each part, and then combine the results together at the end.
To begin, we load in the data.
[3]:
df = pd.read_csv(os.path.join(data_path, "singer_transcript_counts.csv"), comment="#")
And we also need the functions we used for performing the MLE calculation from earlier in the lesson.
[4]:
def log_like_iid_nbinom(params, n):
"""Log likelihood for i.i.d. NBinom measurements, parametrized
by alpha, b=1/beta."""
alpha, b = params
if alpha <= 0 or b <= 0:
return -np.inf
return np.sum(st.nbinom.logpmf(n, alpha, 1/(1+b)))
def mle_iid_nbinom(n):
"""Perform maximum likelihood estimates for parameters for i.i.d.
NBinom measurements, parametrized by alpha, b=1/beta"""
with warnings.catch_warnings():
warnings.simplefilter("ignore")
res = scipy.optimize.minimize(
fun=lambda params, n: -log_like_iid_nbinom(params, n),
x0=np.array([3, 3]),
args=(n,),
method='Powell'
)
if res.success:
return res.x
else:
raise RuntimeError('Convergence failed with message', res.message)
Explicit RNGs
When we parallelize the calculation, we have to be very careful about the random number generators we use. If we define a random number generator globally with rg = np.random.default_rng()
, then each parallel process will use the same sequence of pseudorandom numbers. We therefore need to make a new RNG for each of the parallel processes. As a result, we need to explicitly supply a random number generator to the function we use to generate samples. This changes the call signature to include
the RNG, but requires no other changes.
[5]:
def gen_nbinom(alpha, b, size, rg):
return rg.negative_binomial(alpha, 1 / (1 + b), size=size)
Next, we need to adjust the draw_bs_reps_mle()
function to also have explicit specification of a RNG. If we do not supply one, a new one should be created within the function.
Finally, since the parallel version will supersede the draw_bs_reps_mle()
function we have already written, we will prepend its name with an underscore, designating it as a private function (essentially a function an end user will not call), since we will not really call it directly again except for demonstration purposes.
[6]:
def _draw_parametric_bs_reps_mle(
mle_fun, gen_fun, data, args=(), size=1, progress_bar=False, rg=None,
):
"""Draw parametric bootstrap replicates of maximum likelihood estimator.
Parameters
----------
mle_fun : function
Function with call signature mle_fun(data, *args) that computes
a MLE for the parameters
gen_fun : function
Function to randomly draw a new data set out of the model
distribution parametrized by the MLE. Must have call
signature `gen_fun(*params, size, *args, rg)`.
data : one-dimemsional Numpy array
Array of measurements
args : tuple, default ()
Arguments to be passed to `mle_fun()`.
size : int, default 1
Number of bootstrap replicates to draw.
progress_bar : bool, default False
Whether or not to display progress bar.
rg : numpy.random.Generator instance, default None
RNG to be used in bootstrapping. If None, the default
Numpy RNG is used with a fresh seed based on the clock.
Returns
-------
output : numpy array
Bootstrap replicates of MLEs.
"""
if rg is None:
rg = np.random.default_rng()
params = mle_fun(data, *args)
if progress_bar:
iterator = tqdm.tqdm(range(size))
else:
iterator = range(size)
return np.array(
[mle_fun(gen_fun(*params, size=len(data), *args, rg=rg)) for _ in iterator]
)
These functions still work as advertised. We can call _draw_parametric_bs_reps_mle()
exactly as we have earlier in the lesson. To get 100 bootstrap replicates (only 100 so we don’t have to wait too long), we can do the following for Nanog.
[7]:
bs_reps = _draw_parametric_bs_reps_mle(
mle_iid_nbinom, gen_nbinom, df["Nanog"].values, args=(), size=100, progress_bar=True
)
100%|████████████████████████████████████████████████████████████████████| 100/100 [00:02<00:00, 35.14it/s]
Enabling parallelization
To enable multiple cores to draw bootstrap replicates, we could use Python’s built-in multiprocessing module. This module has some problems running with Jupyter notebooks on some OSs, though. As a drop-in replacement, we could use Mike McKerns’s multiprocess package, which offers improvements on the built-in multiprocessing module. You can install multiprocess with
pip install --upgrade multiprocess
To enable seamless switching between these two multiprocessing packages, I included the following import statement.
try:
import multiprocess
except:
import multiprocessing as multiprocess
We can therefore use them interchangably, calling whichever package we are using multiprocess
. The syntax is as follows.
with multiprocess.Pool(n_jobs) as pool:
result = pool.starmap(fun, arg_iterable)
Here, fun
is a function and arg_iterable
is an iterable that produces arguments to the fun
as tuples that will be splatted when passed to fun
. This is best seen with a simple example before we get into the more complicated example of bootstrapping. We will choose a function that adds four arguments. Then, arg_iterable
will be an iterable (in this case a list) with tuples, each of which has four elements.
[8]:
def fun(a, b, c, d):
return a + b + c + d
arg_iterable = [(i, i+1, i+2, i+3) for i in range(8)]
# Display arg_iterable
arg_iterable
[8]:
[(0, 1, 2, 3),
(1, 2, 3, 4),
(2, 3, 4, 5),
(3, 4, 5, 6),
(4, 5, 6, 7),
(5, 6, 7, 8),
(6, 7, 8, 9),
(7, 8, 9, 10)]
If we call fun
with one of the arguments from the arg_iterable
using a splat operator, we get the appropriate sum.
[9]:
fun(*arg_iterable[0])
[9]:
6
To iterate through the arguments and evaluate them in fun
each time, we can do it in parallel, for example using two cores.
[10]:
with multiprocess.Pool(2) as pool:
result = pool.starmap(fun, arg_iterable)
# Take a look
result
[10]:
[6, 10, 14, 18, 22, 26, 30, 34]
This gives the same result as if we did the following.
[11]:
[fun(*args) for args in arg_iterable]
[11]:
[6, 10, 14, 18, 22, 26, 30, 34]
We can see the equivalence then.
with multiprocess.Pool(2) as pool:
result = pool.starmap(fun, arg_iterable)
is the same as
result = [fun(*args) for args in arg_iterable]
We can use this same structure for our bootstrap replicates. Here, fun
is _draw_bs_reps_mle
, and we need to make a list of arguments to pass to that function. This means splitting up size
into chunks of size size // n_jobs
,
[12]:
def draw_parametric_bs_reps_mle(
mle_fun, gen_fun, data, args=(), size=1, n_jobs=1, progress_bar=False
):
"""Draw nonparametric bootstrap replicates of maximum likelihood estimator.
Parameters
----------
mle_fun : function
Function with call signature mle_fun(data, *args) that computes
a MLE for the parameters
gen_fun : function
Function to randomly draw a new data set out of the model
distribution parametrized by the MLE. Must have call
signature `gen_fun(*params, size, *args, rg)`.
data : one-dimemsional Numpy array
Array of measurements
args : tuple, default ()
Arguments to be passed to `mle_fun()`.
size : int, default 1
Number of bootstrap replicates to draw.
n_jobs : int, default 1
Number of cores to use in drawing bootstrap replicates.
progress_bar : bool, default False
Whether or not to display progress bar.
Returns
-------
output : numpy array
Bootstrap replicates of MLEs.
"""
# Just call the original function if n_jobs is 1 (no parallelization)
if n_jobs == 1:
return _draw_parametric_bs_reps_mle(
mle_fun, gen_fun, data, args=args, size=size, progress_bar=progress_bar
)
# Set up sizes of bootstrap replicates for each core, making sure we
# get all of them, even if sizes % n_jobs != 0
sizes = [size // n_jobs for _ in range(n_jobs)]
sizes[-1] += size - sum(sizes)
# Build arguments
arg_iterable = [(mle_fun, gen_fun, data, args, s, progress_bar, None) for s in sizes]
with multiprocess.Pool(n_jobs) as pool:
result = pool.starmap(_draw_parametric_bs_reps_mle, arg_iterable)
return np.concatenate(result)
Now, we can specify n_jobs
to be greater than one to take advantage of multiple cores. It is important to know how many cores you have on your machine. I usually keep one or two cores open for other processes on my computer so it doesn’t die on me. You can check how many cores you have using the multiprocess.cpu_count()
function.
[13]:
multiprocess.cpu_count()
[13]:
4
My computer has four cores, I will therefore use three cores. And now that I am unleashing three cores on the problem, I will take 10,000 replicates again, and this time it will only take about 1/3 as long as it did the first time I did the calculation.
[14]:
bs_reps = draw_parametric_bs_reps_mle(
mle_iid_nbinom,
gen_nbinom,
df["Nanog"].values,
args=(),
size=10000,
n_jobs=3,
progress_bar=True,
)
100%|██████████████████████████████████████████████████████████████████| 3334/3334 [01:46<00:00, 31.23it/s]
100%|██████████████████████████████████████████████████████████████████| 3333/3333 [01:46<00:00, 31.21it/s]
100%|██████████████████████████████████████████████████████████████████| 3333/3333 [01:46<00:00, 31.18it/s]
From these replicates, we can compute the confidence intervals for \(\alpha\) and \(b\) for Nanog.
[15]:
np.percentile(bs_reps, [2.5, 97.5], axis=0)
[15]:
array([[ 1.09576658, 56.83344803],
[ 1.49960641, 82.32831776]])
Parameter estimation for all genes
We are now empowered to compute the MLE and confidence intervals for all four genes. We will do so with parametric bootstrap. We can ignore the warnings, as the occur when the solver attempts to try illegal (negative) parameter values. Powell’s method is immune to this issue.
[16]:
genes = ["Nanog", "Prdm14", "Rest", "Rex1"]
# Data frame to store results
df_mle = pd.DataFrame(
columns=["gene", "parameter", "mle", "conf_int_low", "conf_int_high"]
)
# Perform MLE for each gene
for gene in tqdm.tqdm(genes):
mle = mle_iid_nbinom(df[gene].values)
bs_reps = draw_parametric_bs_reps_mle(
mle_iid_nbinom,
gen_nbinom,
df[gene].values,
args=(),
size=10000,
n_jobs=3,
progress_bar=False,
)
conf_int = np.percentile(bs_reps, [2.5, 97.5], axis=0)
sub_df = pd.DataFrame(
{
"gene": [gene] * 2,
"parameter": ["alpha", "b"],
"mle": mle,
"conf_int_low": conf_int[0, :],
"conf_int_high": conf_int[1, :],
}
)
df_mle = df_mle.append(sub_df)
100%|███████████████████████████████████████████████████████████████████████| 4/4 [06:55<00:00, 103.89s/it]
We now have a data frame with the results from our statistical inference. Let’s take a look.
[17]:
df_mle
[17]:
gene | parameter | mle | conf_int_low | conf_int_high | |
---|---|---|---|---|---|
0 | Nanog | alpha | 1.263097 | 1.091002 | 1.496271 |
1 | Nanog | b | 69.347842 | 56.539883 | 82.969040 |
0 | Prdm14 | alpha | 0.552886 | 0.453012 | 0.687179 |
1 | Prdm14 | b | 8.200636 | 6.219361 | 10.512369 |
0 | Rest | alpha | 4.530335 | 3.871356 | 5.435219 |
1 | Rest | b | 16.543054 | 13.612535 | 19.339556 |
0 | Rex1 | alpha | 1.634562 | 1.414209 | 1.929717 |
1 | Rex1 | b | 84.680915 | 69.631135 | 99.831436 |
Finally, we can make a plot of the MLEs of the parameter values with confidence interbals.
[18]:
sub_df = df_mle.loc[df_mle["parameter"] == "alpha", :]
summaries = [
{"estimate": est, "conf_int": conf, "label": label}
for est, conf, label in zip(
sub_df["mle"].values,
sub_df[["conf_int_low", "conf_int_high"]].values,
sub_df["gene"],
)
]
p1 = bebi103.viz.confints(
summaries,
x_axis_label="α*",
frame_height=75,
)
sub_df = df_mle.loc[df_mle["parameter"] == "b", :]
summaries = [
{"estimate": est, "conf_int": conf, "label": label}
for est, conf, label in zip(
sub_df["mle"].values,
sub_df[["conf_int_low", "conf_int_high"]].values,
sub_df["gene"],
)
]
p2 = bebi103.viz.confints(
summaries,
x_axis_label="b*",
frame_height=75,
)
bokeh.io.show(bokeh.layouts.gridplot([p1, p2], ncols=1))
Implementation of MLE bootstrapping in the bebi103 package
The functionality we have developed in this lesson for bootstrapping MLE estimates is available in the function bebi103.bootstrap.draw_bs_reps_mle()
. It has call signature
bebi103.draw_bs_reps_mle(
mle_fun,
gen_fun,
data,
mle_args=(),
gen_args=(),
size=1,
n_jobs=1,
progress_bar=False,
rg=None,
):
Here, mle_fun
is a function with call signature mle_fun(data, *mle_fun_args)
that computes the maximum likelihood estimate from the data. gen_fun
is a function with call signature gen_fun(params, *gen_args, size, rg)
that generates bootstrap samples to be used to make bootstrap replicates of the maximum likelihood estimates. Here, params
is a tuple of the parameters of the model. Note that not all of the inputs for gen_fun
are necessarily used, but having them allows for
flexibility. The bebi103.bootstrap.draw_bs_reps_mle()
function allows for both parametric and nonparametric bootstrapping.
As an example, we will repeat the bootstrapped MLE confidence region calculation for Nanog, except we will do it nonparametrically this time.
[19]:
# This is gen_fun, for nonparametric, just a resampling
# params = (alpha, beta), but is ignored
# gen_params = (n,); this is used for nonparametric sample
def resample_fun(params, n, size, rg):
return rg.choice(n, size=size)
bs_reps = bebi103.bootstrap.draw_bs_reps_mle(
mle_iid_nbinom,
resample_fun,
df['Nanog'].values,
gen_args=(df['Nanog'].values, ),
size=10000,
n_jobs=3,
progress_bar=True,
)
# Show the confidence interval
np.percentile(bs_reps, [2.5, 97.5], axis=0)
100%|██████████████████████████████████████████████████████████████████| 3333/3333 [01:35<00:00, 35.00it/s]
100%|██████████████████████████████████████████████████████████████████| 3333/3333 [01:35<00:00, 34.94it/s]
100%|██████████████████████████████████████████████████████████████████| 3334/3334 [01:35<00:00, 34.91it/s]
[19]:
array([[ 1.05638896, 57.23461435],
[ 1.5402667 , 82.63108233]])
The confidence intervals are slightly different than in the parametric case, but are quite similar.
Computing environment
[20]:
%load_ext watermark
%watermark -v -p multiprocess,numpy,scipy,pandas,tqdm,bokeh,bebi103,jupyterlab
Python implementation: CPython
Python version : 3.8.12
IPython version : 7.27.0
multiprocess: 0.70.12.2
numpy : 1.21.2
scipy : 1.7.1
pandas : 1.3.3
tqdm : 4.62.2
bokeh : 2.3.3
bebi103 : 0.1.8
jupyterlab : 3.1.7