Graphical model assessment for univariate models

Dataset download


[1]:
import warnings

import pandas as pd
import numpy as np
import scipy.optimize
import scipy.stats as st

import numba

import bebi103

import iqplot

import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

Graphical model assessment

We have already discussed model assessment. In this lesson, we will discuss implementation of graphical model assessment.

We will use the transcript counts from Singer, et al., this time focusing on the Rex1 gene, which shows clear biomodality. As a reminder, here is the ECDF for the transcript counts.

[2]:
df = pd.read_csv("../data/singer_transcript_counts.csv", comment="#")

p = iqplot.ecdf(
    data=df, q="Rex1", x_axis_label="Rex1 mRNA count", conf_int=True
)

bokeh.io.show(p)

Before we can proceed to the model assessments, we need to write functions to calculate the MLE.

Likelihood calculators

In this and the following parts of this lesson, we will need the functions we have written to perform maximum likelihood estimation for a Negative Binomial distribution and for a mixture of two Negative Binomials. These are copied directly from the lesson on mixture models, so you do not need to focus on this rather long code cell.

[3]:
def log_like_iid_nbinom(params, n):
    """Log likelihood for i.i.d. NBinom measurements, parametrized
    by alpha, b=1/beta."""
    alpha, b = params

    if alpha <= 0 or b <= 0:
        return -np.inf

    return np.sum(st.nbinom.logpmf(n, alpha, 1 / (1 + b)))


def mle_iid_nbinom(n):
    """Perform maximum likelihood estimates for parameters for i.i.d.
    NBinom measurements, parametrized by alpha, b=1/beta"""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, n: -log_like_iid_nbinom(params, n),
            x0=np.array([3, 3]),
            args=(n,),
            method="Powell",
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError("Convergence failed with message", res.message)


def initial_guess_mix(n, w_guess):
    """Generate initial guess for mixture model."""
    n_low = n[n < np.percentile(n, 100 * w_guess)]
    n_high = n[n >= np.percentile(n, 100 * w_guess)]

    alpha1, b1 = mle_iid_nbinom(n_low)
    alpha2, b2 = mle_iid_nbinom(n_high)

    return alpha1, b1, alpha2, b2


def log_like_mix(alpha1, b1, alpha2, b2, w, n):
    """Log-likeihood of binary Negative Binomial mixture model."""
    # Fix nonidentifieability be enforcing values of w
    if w < 0 or w > 1:
        return -np.inf

    # Physical bounds on parameters
    if alpha1 < 0 or alpha2 < 0 or b1 < 0 or b2 < 0:
        return -np.inf

    logx1 = st.nbinom.logpmf(n, alpha1, 1 / (1 + b1))
    logx2 = st.nbinom.logpmf(n, alpha2, 1 / (1 + b2))

    # Multipliers for log-sum-exp
    lse_coeffs = np.tile([w, 1 - w], [len(n), 1]).transpose()

    # log-likelihood for each measurement
    log_likes = scipy.special.logsumexp(np.vstack([logx1, logx2]), axis=0, b=lse_coeffs)

    return np.sum(log_likes)


def mle_mix(n, w_guess):
    """Obtain MLE estimate for parameters for binary mixture
    of Negative Binomials."""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, n: -log_like_mix(*params, n),
            x0=[*initial_guess_mix(n, w_guess), w_guess],
            args=(n,),
            method="Powell",
            tol=1e-6,
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError("Convergence failed with message", res.message)

We can now carry out the MLE for both models.

[4]:
# Load in data
df = pd.read_csv("../data/singer_transcript_counts.csv", comment="#")
n = df['Rex1'].values

# Single Negative Binomial MLE
alpha, b = mle_iid_nbinom(n)

# Mixture model MLE
alpha1, b1, alpha2, b2, w = mle_mix(n, 0.2)

Constructing a Q-Q plot

We will start by making a Q-Q plot for the single Negative-Binomial case. The function bebi103.viz.qqplot() generates Q-Q plots. In order to provide the function with samples out of the generative distribution, let’s draw out of a single Negative Binomial with our \(\alpha\), \(b\) parametrization.

[5]:
rg = np.random.default_rng()

single_samples = np.array(
    [rg.negative_binomial(alpha, 1 / (1 + b), size=len(n)) for _ in range(100000)]
)

With that in place, we can call bebi103.viz.qqplot() to generate the plot.

[6]:
p = bebi103.viz.qqplot(
    data=n,
    samples=single_samples,
    x_axis_label="transcript counts",
    y_axis_label="transcript counts",
)

bokeh.io.show(p)

Clearly, the generative model cannot produce the observed transcript bounds, since the Q-Q plot shows strong separation of the generative quantiles from the observed quantiles.

Predictive ECDFs

If we wish to accept a model with the MLE for the parameters as a good representation of the true generative model, we would like to see how data generated from that model would look. We can do a parametric bootstrap to generate data sets and then plot confidence intervals of the resulting ECDFs. We call ECDFs that come from the generative model predictive ECDFs. We can try that for the single Negative Binomial model.

Given the samples (which we already computed and stored in single_samples), we can then compute the predictive ECDFs for each of these samples. We want the values of the predictive ECDF at each possible value of \(n\), which requires us to compute the value of the ECDF at an arbitrary value. You wrote a function, ecdf(x, data), to do this in a homework problem. Referring to the homework solutions, we can use that function here.

[7]:
@numba.njit
def ecdf(x, data):
    """Give the value of an ECDF at arbitrary points x."""
    y = np.arange(len(data) + 1) / len(data)
    return y[np.searchsorted(np.sort(data), x, side="right")]

Now we can compute the value of the ECDF for the respective values of \(n\).

[8]:
n_theor = np.arange(0, single_samples.max() + 1)

ecdfs = np.array([ecdf(n_theor, sample) for sample in single_samples])

Now that we have the ECDF values, we can compute a 95th-percentile of the value of the ECDF.

[9]:
ecdf_low, ecdf_high = np.percentile(ecdfs, [2.5, 97.5], axis=0)

And now, we can make a plot.

[10]:
p = bebi103.viz.fill_between(
    x1=n_theor,
    y1=ecdf_high,
    x2=n_theor,
    y2=ecdf_low,
    patch_kwargs={"fill_alpha": 0.5},
    x_axis_label="n",
    y_axis_label="ECDF",
)

bokeh.io.show(p)

This is a predictive ECDF. Is is the envelope in which we would expect 95% of the data points to lie if the generative model were true.

Comparing with data

We can overlay the ECDF of the data to see if it falls within this envelope.

[11]:
p = iqplot.ecdf(data=df, q='Rex1', palette=['orange'], p=p)

bokeh.io.show(p)

The Negative Binomial model generated some counts that were quite large, which is why the plot extends past counts of 1000. If you zoom in on the region where the data lie, the deviation of the measured ECDF from the ECDF generated by the model is clear.

We can also compare with the mixture model.

Using predictive ECDFs in the bebi103 package

The bebi103 package enables automated generation of predictive ECDFs.

[12]:
# Use discrete=True for discrete data
p = bebi103.viz.predictive_ecdf(
    samples=single_samples, data=n, discrete=True, x_axis_label="n"
)

bokeh.io.show(p)

Here we see that the predictive ECDF does not capture the measured data. If you zoom into the plot, you will notice that the bebi103.viz.predictive_ecdf() function plots quantiles in various shades of blue. By default, the outermost region (lightest blue) contains the middle 95% of the generated ECDFs, and the next one in captures the middle 68%. The median ECDF from the generative model is a dark blue line in the middle.

ECDF difference

While the predictive ECDF plot is useful, it is perhaps more illuminating to plot the difference between the predictive ECDF and the measured ECDF. We can do this with the diff='ecdf' kwarg.

[13]:
p = bebi103.viz.predictive_ecdf(
    samples=single_samples, data=n, diff='ecdf', discrete=True, x_axis_label="n"
)

bokeh.io.show(p)

It is now more clear that most of the 279 data points lie outside the 95% envelope of the predictive ECDF.

Computing environment

[14]:
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,bokeh,iqplot,bebi103,jupyterlab
Python implementation: CPython
Python version       : 3.9.12
IPython version      : 8.2.0

numpy     : 1.21.5
pandas    : 1.4.2
scipy     : 1.7.3
bokeh     : 2.4.2
iqplot    : 0.2.5
bebi103   : 0.1.11
jupyterlab: 3.3.2