Confidence intervals for a MLE

Dataset download

[1]:

import warnings

import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st

import tqdm

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

Loading BokehJS ...

In the previous part of this lesson, we performed a maximum likelihood estimate for the parameters of a Negative Binomial model for mRNA counts for Nanog using the Singer, et al. dataset. Here, we will compute the MLEs for the other three genes as well and further compute confidence intervals for these MLEs.

MLE for all genes

We start by writing function to compute the maximum likelihood estimate for our model. We have a set of Negative-Binomially distributed i.i.d. counts, parametrized by \(\alpha\) and \(b=1/\beta\), and we wish to compute the MLE. We worked out the code in the previous part of the lesson, so we can write the functions here.

[2]:

def log_like_iid_nbinom(params, n):
    """Log likelihood for i.i.d. NBinom measurements, parametrized
    by alpha, b=1/beta."""
    alpha, b = params

    if alpha <= 0 or b <= 0:
        return -np.inf

    return np.sum(st.nbinom.logpmf(n, alpha, 1/(1+b)))


def mle_iid_nbinom(n):
    """Perform maximum likelihood estimates for parameters for i.i.d.
    NBinom measurements, parametrized by alpha, b=1/beta"""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, n: -log_like_iid_nbinom(params, n),
            x0=np.array([3, 3]),
            args=(n,),
            method='Powell'
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)

Now, we just have to load in the data and compute the MLEs.

[3]:

# Load DataFrame
df = pd.read_csv('../data/singer_transcript_counts.csv', comment='#')
genes = ["Nanog", "Prdm14", "Rest", "Rex1"]

# Data frame to store results
df_mle = pd.DataFrame(columns=['gene', 'parameter', 'mle'])

# Perform MLE for each gene
for gene in genes:
    mle = mle_iid_nbinom(df[gene].values)
    sub_df = pd.DataFrame({'gene': [gene]*2, 'parameter': ['alpha', 'b'], 'mle': mle})
    df_mle = pd.concat((df_mle, sub_df))

df_mle = df_mle.reset_index()

Let’s take a look at the results.

[4]:

df_mle

[4]:

	index	gene	parameter	mle
0	0	Nanog	alpha	1.263097
1	1	Nanog	b	69.347842
2	0	Prdm14	alpha	0.552886
3	1	Prdm14	b	8.200636
4	0	Rest	alpha	4.530335
5	1	Rest	b	16.543054
6	0	Rex1	alpha	1.634562
7	1	Rex1	b	84.680915

Bootstrap confidence intervals

Remember that the MLE is a random variable, so we can compute confidence intervals for it. There are several approaches to computing confidence intervals for maximum likelihood estimates. We will explore two methods. They differ in how we use our model to generate bootstrap samples.

In this case of mRNA counts, the data were generated out of an unknown distribution \(F(n)\). We do not know what it is, though we model it as Negative Binomial to compute an MLE. (We chould choose other models, some will be closer to \(F(n)\) than others.) To generate a bootstrap sample, we sample out of the empirical distribution \(\hat{F}(n;\mathbf{n})\). We then use the Negative Binomial model to obtain MLEs for the burst frequency \(\alpha\) and burst size \(b\). These MLEs constitute our bootstrap replicates for the MLE, and we can compute confidence intervals from the replicates.
Instead of approximating the generative distribution \(F(n)\) with \(\hat{F}(n;\mathbf{n})\), we approximate the generative distribution with the model distribution, in this case a Negative Binomial, parametrized by the MLE. In other words, \(F(n) \approx F_\mathrm{NBinom}(n; \alpha^*, b^*)\). We then sample out of the model distribution to generate our bootstrap samples, and compute the replicates from these. This is called parametric bootstrap and seems to be more widely used. It operates under the assumption that our model is indeed a good approximation for the generative distribution.

The first method is nonparametric in that we use the plug-in principle for approximating the generative distribution (and only use the model to compute the statistical functional), so we will refer to it as the nonparametric bootstrap for the MLE, but I have not seen that terminology be widely used, probably because of confusion with the fact that a parametric model is used to compute the MLE.

Nonparametric MLE confidence interval

For the nonparametric MLE confidence interval, we simply draw a bootstrap sample out of the original data set, as we have been doing, and then compute the MLE estimates. Let’s do that. We will use TQDM to give us a progress bar, since this calculation takes a few minutes.

[5]:

rg = np.random.default_rng(3252)

def draw_bs_sample(data):
    """Draw a bootstrap sample from a 1D data set."""
    return rg.choice(data, size=len(data))


def draw_bs_reps_mle(mle_fun, data, args=(), size=1, progress_bar=False):
    """Draw nonparametric bootstrap replicates of maximum likelihood estimator.

    Parameters
    ----------
    mle_fun : function
        Function with call signature mle_fun(data, *args) that computes
        a MLE for the parameters
    data : one-dimemsional Numpy array
        Array of measurements
    args : tuple, default ()
        Arguments to be passed to `mle_fun()`.
    size : int, default 1
        Number of bootstrap replicates to draw.
    progress_bar : bool, default False
        Whether or not to display progress bar.

    Returns
    -------
    output : numpy array
        Bootstrap replicates of MLEs.
    """
    if progress_bar:
        iterator = tqdm.tqdm(range(size))
    else:
        iterator = range(size)

    return np.array([mle_fun(draw_bs_sample(data), *args) for _ in iterator])

With these functions in hand, we can draw our bootstrap replicates. Note the the replicates will be returned as a two-dimensional array, with the first column being samples of \(\alpha^*\) and the second being samples of \(b^*\). We will do this for Nanog first. Normally we would compute 100,000 bootstrap samples, but to limit the runtime of the notebook, we will instead only draw 1000 samples.

[6]:

bs_reps = draw_bs_reps_mle(
    mle_iid_nbinom, df["Nanog"].values, size=1000, progress_bar=True
)

100%|███████████████████████████████████████████████████████████| 1000/1000 [03:24<00:00,  4.89it/s]

To get the confidence intervals, we can take the percentiles as we have done before. Note that we can use the axis kwarg to enable computing confidence intervals for both parameters in one statement.

[7]:

conf_int = np.percentile(bs_reps, [2.5, 97.5], axis=0)

# Take a look
conf_int

[7]:

array([[ 1.07137593, 57.60221936],
       [ 1.538369  , 81.86819681]])

So, the confidence interval for \(\alpha\) lies between 1.06 and 1.55, and that for \(b\) lies between 57.1 and 82.3.

Confidence regions

The confidence intervals we have just reported are the confidence intervals for the respective parameters, but we really should define a confidence region, which is the multidimensional extension of a confidence interval. (A confidence interval is the special case of a confidence region with only one parameter.)

Indeed, the confidence interval for \(\alpha\) comes from percentiles computed from the probability density function \(f(\alpha^*;\mathbf{n})\). (Strictly speaking this is a probability mass function in this case because there are a finite many unique bootstrap samples, but we do not need to worry about that distinction because there are a lot(!) of unique bootstrap samples.) This PDF is computed from the joint PDF of the MLEs as

\begin{align} f(\alpha^*;\mathbf{n}) = \int \mathrm{d}b^*\,f(\alpha^*, b^*;\mathbf{n}). \end{align}

The marginalization was done implicitly by just considering the bootstrap samples of the individual parameters, but bootstrap replicate \(i\) of \(\alpha^*\) is related to bootstrap replicate \(i\) of \(b^*\) because they were found together in a maximization of the likelihood function. We might want to plot the joint distribution. One way to visualize this is to just plot all of the bootstrap replicates.

[8]:

p = bokeh.plotting.figure(
    frame_width=300,
    frame_height=300,
    x_axis_label="α*",
    y_axis_label="b*"
)

p.circle(
    bs_reps[:,0],
    bs_reps[:,1],
    size=2,
    alpha=0.3
)

bokeh.io.show(p)

There is a correlation between our MLE estimates for \(\alpha\) and \(b\). We can also visualize the distributions for \(f(\alpha^*,b^*;\mathbf{n})\) using the bebi103.viz.corner() function.

[9]:

# Package replicates in data frame for plotting
df_res = pd.DataFrame(data=bs_reps, columns=["α*", "b*"])

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    p_corner = bebi103.viz.corner(
        samples=df_res,
        parameters=["α*", "b*"],
        show_contours=True,
        levels = [0.95],
    )

bokeh.io.show(p_corner)

The plots on the diagonal contain the histograms of the bootstrap replicates for the MLE estimates. The off-diagonal plot shows a plot of all the bootstrap replicates of the MLE with a contour plot overlaid. Note that the contour looks rough because we only have 1000 samples; we normally would have 100 times that many. By selecting levels=0.95, we have asked for a 95% confidence region; approximately 95% of all bootstrap replicates lie within the contour. This is not exact, though, as some smoothing is done to generate the contour, which is necessary for confidence regions in two or more dimensions because we have a finite number of samples.

An important note: These are not distributions of the true parameters values. In a frequentist setting, that has no meaning. These are distributions of the maximum likelihood estimates for the parameters where we have estimated the generative distribution using the empirical distribution.

It is also useful to note that we can access the contour lines directly using the bebi103.viz.contour_lines_from_samples() function and can overlay them on the samples. We may also find such contours useful in other custom visualizations.

[10]:

# Get contour line
xs, ys = bebi103.viz.contour_lines_from_samples(bs_reps[:,0], bs_reps[:,1], levels=0.95)

# Overlay with sample
p.line(xs[0], ys[0], color='black', line_width=2)

bokeh.io.show(p)

Parametric confidence intervals

We repeat the calculation for a parametric confidence interval. As we write a function for this, we need to include an mle_fun as for draw_bs_reps_mle(), but we also need to provide a function that draws new data sets out of the parametric model.

[11]:

def draw_parametric_bs_reps_mle(
    mle_fun, gen_fun, data, args=(), size=1, progress_bar=False
):
    """Draw parametric bootstrap replicates of maximum likelihood estimator.

    Parameters
    ----------
    mle_fun : function
        Function with call signature mle_fun(data, *args) that computes
        a MLE for the parameters
    gen_fun : function
        Function to randomly draw a new data set out of the model
        distribution parametrized by the MLE. Must have call
        signature `gen_fun(*params, size)`.
    data : one-dimemsional Numpy array
        Array of measurements
    args : tuple, default ()
        Arguments to be passed to `mle_fun()`.
    size : int, default 1
        Number of bootstrap replicates to draw.
    progress_bar : bool, default False
        Whether or not to display progress bar.

    Returns
    -------
    output : numpy array
        Bootstrap replicates of MLEs.
    """
    params = mle_fun(data, *args)

    if progress_bar:
        iterator = tqdm.tqdm(range(size))
    else:
        iterator = range(size)

    return np.array(
        [mle_fun(gen_fun(*params, size=len(data), *args)) for _ in iterator]
    )

To draw parametric bootstrap replicates, we need only to provide the function to generate samples from the model distribution. We directly use the Negative Binomial functionality of the Numpy random module.

[12]:

def gen_nbinom(alpha, b, size):
    return rg.negative_binomial(alpha, 1 / (1 + b), size=size)

Let’s draw them for Nanog.

[13]:

bs_reps_parametric = draw_parametric_bs_reps_mle(
    mle_iid_nbinom,
    gen_nbinom,
    df["Nanog"].values,
    args=(),
    size=1000,
    progress_bar=True,
)

100%|███████████████████████████████████████████████████████████| 1000/1000 [02:48<00:00,  5.95it/s]

Now that we have our samples, we can compute the confidence intervals, as we did in the nonparametric case.

[14]:

np.percentile(bs_reps_parametric, [2.5, 97.5], axis=0)

[14]:

array([[ 1.10219859, 56.63858263],
       [ 1.49805102, 83.00888014]])

We get similar confidence intervals as for the nonparametric case. We can also look at the corner plot.

[15]:

# Package replicates in data frame for plotting
df_res = pd.DataFrame(data=bs_reps_parametric, columns=["α*", "b*"])

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    p = bebi103.viz.corner(
        samples=df_res,
        parameters=["α*", "b*"],
        show_contours=True,
        levels = [0.95],
    )

bokeh.io.show(p)

The corner plot is indeed similar to the nonparametric case.

We could compute confidence intervals for the other genes, but we will not do that here to save time, since each computation takes about three minutes. In the next lesson, we will parallelize the calculations to make them faster.

Computing environment

[16]:

%load_ext watermark
%watermark -v -p numpy,scipy,pandas,tqdm,bokeh,bebi103,jupyterlab

Python implementation: CPython
Python version       : 3.9.12
IPython version      : 8.2.0

numpy     : 1.21.5
scipy     : 1.7.3
pandas    : 1.4.2
tqdm      : 4.64.0
bokeh     : 2.4.2
bebi103   : 0.1.11
jupyterlab: 3.3.2