Graphical model assessment: Predictive regression
[2]:
import warnings
import pandas as pd
import numpy as np
import scipy.optimize
import scipy.stats as st
import bebi103
import holoviews as hv
hv.extension('bokeh')
bebi103.hv.set_defaults()
import bokeh.io
bokeh.io.output_notebook()
/Users/bois/opt/anaconda3/lib/python3.8/site-packages/arviz/__init__.py:317: UserWarning: Trying to register the cmap 'cet_gray' which already exists.
register_cmap("cet_" + name, cmap=cmap)
/Users/bois/opt/anaconda3/lib/python3.8/site-packages/arviz/__init__.py:317: UserWarning: Trying to register the cmap 'cet_gray_r' which already exists.
register_cmap("cet_" + name, cmap=cmap)
We have just seen how a predictive ECDF can be used to graphically assess a model for the case of repeated measurements drawn out of a generative distribution. Now, we turn to a model that involves a regression.
The data set and models
We again use the data set from Good and coworkers that we used in our lessons on model building and regression. In the experiment, Good and coworkers measured the length of mitotic spindles in droplets of varying diameter.
Let’s take a quick look at the data.
[3]:
hv.extension("bokeh")
df = pd.read_csv(os.path.join(data_path, "good_invitro_droplet_data.csv"), comment="#")
scatter = hv.Scatter(
data=df, kdims=["Droplet Diameter (um)"], vdims=["Spindle Length (um)"]
)
scatter
[3]:
Good and coworkers proposed two models for the relationship between spindle length and droplet diameter. In the first model, the spindle length is independent of the droplet diameter and drawn out of a Normal distribution.
\begin{align} l_i \sim \text{Norm}(\phi, \sigma)\;\forall i. \end{align}
In the second model, the spindle length depends on the droplet diameter according to the equation below, and the spindle lengths vary from the theoretical model in a Normal fashion.
\begin{align} &\mu_i = \frac{\gamma d_i}{\left(1 + (\gamma d_i/\phi)^3\right)^{1/3}}\\[1em] &l_i \sim \text{Norm}(\mu_i, \sigma)\;\forall i. \end{align}
These are the two models we will graphically assess, starting with the second model, which establishes a relationship between spindle length and droplet diameter.
Assessing the model for spindle length dependent on droplet diameter
As was the case with our studies of the Singer, et al. data on mRNA counts, we do also need functions from previous lessons here to compute the MLE. You can skip this, since it is copied directly from previous lessons. The important result is that we get the MLE for the three parameters, \(\gamma\), \(\phi\), and \(\sigma\).
[4]:
def theor_spindle_length(gamma, phi, d):
"""Compute spindle length using mathematical model"""
return gamma * d / np.cbrt(1 + (gamma * d / phi)**3)
def log_likelihood(params, d, ell):
"""Log likelihood of spindle length model."""
gamma, phi, sigma = params
if gamma <= 0 or gamma > 1 or phi <= 0:
return -np.inf
mu = theor_spindle_length(gamma, phi, d)
return np.sum(st.norm.logpdf(ell, mu, sigma))
def spindle_mle(d, ell):
"""Compute MLE for parameters in spindle length model."""
with warnings.catch_warnings():
warnings.simplefilter("ignore")
res = scipy.optimize.minimize(
fun=lambda params, d, ell: -log_likelihood(params, d, ell),
x0=np.array([0.5, 35, 5]),
args=(d, ell),
method='Powell'
)
if res.success:
return res.x
else:
raise RuntimeError('Convergence failed with message', res.message)
mle_params = spindle_mle(df['Droplet Diameter (um)'].values, df['Spindle Length (um)'].values)
# Take a look
print("γ* = {0:f}\nφ* = {1:f}\nσ* = {2:f}".format(*mle_params))
γ* = 0.860475
φ* = 38.231250
σ* = 3.753422
Sampling out of the generative model
As before, we want to see what kind of data sets are predicted by the generative model when parametrized by the MLEs of the parameters. We therefore need to write a function to obtain the samples. The extra wrinkle here is that we need to also provide values for the droplet diameter for which we want the samples. In this model, we assume that the droplet diameters are measured exactly and that they determine the spindle length (of course neglecting the variation we model in the residuals).
[5]:
rg = np.random.default_rng()
def sample_spindle(gamma, phi, sigma, d, size=1):
"""Generate samples of spindle length vs droplet diameter."""
samples = np.empty((size, len(d)))
for i in range(size):
mu = theor_spindle_length(gamma, phi, d)
samples[i] = np.maximum(0, rg.normal(mu, sigma))
return samples
Note that this function returns a array of shape (size, len(d))
. That is, each row of the outputted array of samples corresponds to one set of measurements for prescribed droplet diameters d
.
We will generate samples for the droplet diameters going from zero to 250 microns.
[6]:
d_theor = np.linspace(0, 250, 200)
samples = sample_spindle(*mle_params, d_theor, size=10000)
Making the predictive regression plot
Now that we have our samples, we compute the percentiles of the spindle length for each value of \(d\). We then generate the plot from these percentiles, with the data overlaid. The bebi103.viz.predictive_regression()
function accomplishes this for us. We need to provide it with the samples of spindle length (with the shape we have already prescribed), as well as the \(d\)-values for which the samples were generated. We also need to provide the data as an array with two columns, the
first being the diameter and the second the spindle length.
[7]:
p = bebi103.viz.predictive_regression(
samples=samples,
samples_x=d_theor,
data=df[['Droplet Diameter (um)', 'Spindle Length (um)']].values,
x_axis_label='droplet diameter (µm)',
y_axis_label='spindle length (µm)',
)
bokeh.io.show(p)
In this plot, we see that roughly 5% of the data points (there are 670 total points) lie outside the middle 95%, so the model is consistent with the measured data.
In my view, this is how regression results should be plotted, not simple with a “best fit line”, which does not reveal the full generative model.
Viewing differences
We may also wish to get a plot of how different the data are from the model. In this case, the median spindle length is subtracted off of all samples such that the median is zero. We also need to subtract it from the data to make a comparison. That means that we have to draw our samples from the generative distribution for values of the droplet diameter that were actually measured.
[8]:
samples = sample_spindle(*mle_params, df['Droplet Diameter (um)'].values, size=10000)
p = bebi103.viz.predictive_regression(
samples=samples,
samples_x=df['Droplet Diameter (um)'].values,
data=df[['Droplet Diameter (um)', 'Spindle Length (um)']].values,
diff=True,
x_axis_label='droplet diameter (µm)',
y_axis_label='diff. spindle length (µm)',
)
bokeh.io.show(p)
Importantly, there does not seem to be any systematic way in which the measurements deviate from the model. In other words, whether a data point lies outside the 80% envelope and whether it lies above or below looks independent of droplet diameter.
Graphical assessment of the Normal model
As we have learned, the MLE for the parameters for the Normal model are the same as the plug-in estimates.
[9]:
phi = np.mean(df['Spindle Length (um)'])
sigma = np.std(df['Spindle Length (um)'])
We can then draw our samples directly out of a Normal distribution with location parameter \(\phi^*\) and scale parameter \(\sigma^*\).
[10]:
samples = rg.normal(phi, sigma, size=(5000, len(df)))
Given the samples, we can again make a predictive regression plot.
[11]:
p = bebi103.viz.predictive_regression(
samples=samples,
samples_x=df['Droplet Diameter (um)'].values,
percentiles=[68, 80, 95],
data=df[['Droplet Diameter (um)', 'Spindle Length (um)']].values,
x_axis_label='droplet diameter (µm)',
y_axis_label='spindle length (µm)',
)
bokeh.io.show(p)
Here, I have also included the 80% envelope. Again, only about 5% of the data points lie outside the 95% envelope. However, the data points tend to fall below the 80% envelope only for small droplet diameters and above only for large droplet diameters, suggesting that the model is missing the droplet-diameter dependence.
General method of graphical predictive model assessment
When using both predictive ECDFs and predictive regression curves, we did the same procedure, which you can imagine generalizing.
Compute the MLE of the parameters for the generative model in question.
Use the MLE to parametrize the model and generate many data sets out of the model.
Make a plot showing percentile regions of the generated data sets.
Overlay the measured data set.
Evaluate how many of the measured data points lie outside the percentile regions of the generated data sets.
Predictive ECDFs and regression curves are just two (often used) examples of this prescription, and you can develop your own as is useful for the model and data set you are working with.
Computing environment
[12]:
%load_ext watermark
%watermark -v -p numpy,pandas,scipy,bokeh,bebi103,jupyterlab
Python implementation: CPython
Python version : 3.8.12
IPython version : 7.29.0
numpy : 1.21.2
pandas : 1.3.4
scipy : 1.7.1
bokeh : 2.3.3
bebi103 : 0.1.8
jupyterlab: 3.2.1