Regression
[1]:
import warnings
import tqdm
import numpy as np
import pandas as pd
import scipy.optimize
import scipy.stats as st
import iqplot
import bebi103
import bokeh.io
bokeh.io.output_notebook()
We now proceed to do maximum likelihood estimation for the parameters of the second model for spindle length.
\begin{align} &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \sim \text{Norm}(\mu_i, \sigma) \;\forall i, \end{align}
The parameters for this model are \(\gamma\), \(\phi\), and \(\sigma\).
In this model, we have an independent variable, the droplet diameter, and a dependent variable, the spindle length. We refer to the independent variable(s) as a variate and the dependent variable(s) as a covariate. Models of this type are often called variate-covariate models. The procedure of parameter estimation for these models is commonly called regression, which I think is a bit of a misnomer, as regression has other meaning. I prefer to use “variate-covariate modeling,” a preference I developed after discussion with Michael Betancourt.
Finally, before we being doing MLE, as a reminder, we will load in and plot the data.
[2]:
df = pd.read_csv('../data/good_invitro_droplet_data.csv', comment='#')
p = bokeh.plotting.figure(
frame_width=350,
frame_height=250,
x_axis_label='droplet diameter (µm)',
y_axis_label='spindle length (µm)',
)
p.circle(df['Droplet Diameter (um)'], df['Spindle Length (um)'], alpha=0.7)
bokeh.io.show(p)
It is also useful to extract the data sets as Numpy arrays for speed in the MLE and bootstrap calculations.
[3]:
d = df['Droplet Diameter (um)'].values
ell = df['Spindle Length (um)'].values
Performing MLE by direct numerical optimization
We can take the same approach as we did with the mRNA counts from the smFISH experiments. We write a function for the log likelihood and then use numerical optimization to find the parameters.
The joint probability density function for the data set is the product of the probability densities of Normals. Therefore, the log likelihood is the sum of log probability densities of Normals. We can code that up using the convenient functions in the scipy.stats
module.
[4]:
def theor_spindle_length(gamma, phi, d):
"""Compute spindle length using mathematical model"""
return gamma * d / np.cbrt(1 + (gamma * d / phi)**3)
def log_likelihood(params, d, ell):
"""Log likelihood of spindle length model."""
gamma, phi, sigma = params
if gamma <= 0 or gamma > 1 or phi <= 0:
return -np.inf
mu = theor_spindle_length(gamma, phi, d)
return np.sum(st.norm.logpdf(ell, mu, sigma))
In looking at the code for the log likelihood, the convenience of the scipy.stats
module is clear. We more or less code up the model as we would say it!
We can now code up the MLE calculation. For our initial guesses, we can look at the data an make estimates. The initial slope of the \(l\) vs. \(d\) curve is about one-half, so we will guess an initial value of \(\gamma\) of 0.5. The plateau seems to be around 40 µm, so we will guess \(\phi = 40\) µm. Eyeballing the standard deviation in the data points, I would guess it is about 5 µm.
[5]:
def spindle_mle(d, ell):
"""Compute MLE for parameters in spindle length model."""
with warnings.catch_warnings():
warnings.simplefilter("ignore")
res = scipy.optimize.minimize(
fun=lambda params, d, ell: -log_likelihood(params, d, ell),
x0=np.array([0.5, 35, 5]),
args=(d, ell),
method='Powell'
)
if res.success:
return res.x
else:
raise RuntimeError('Convergence failed with message', res.message)
Let’s run the calculation to get MLEs for the parameters.
[6]:
mle_params = spindle_mle(d, ell)
mle_params
[6]:
array([ 0.86047455, 38.23124995, 3.75342168])
We can take a quick look as to how the theoretical curve might look with respect to the data.
[7]:
# Compute theoretical curve using MLE parameters
d_theor = np.linspace(0, 250, 200)
ell_theor = theor_spindle_length(*mle_params[:-1], d_theor)
p.line(d_theor, ell_theor, color='orange', line_width=2)
bokeh.io.show(p)
Of course, we are not using the full generative model here, since we are not using the parameter \(\sigma\). We will discuss in much more detail in the lesson on model assessment how to visualize the entire generative model as it relates to the measured data.
Least squares
The likelihood for this model is
\begin{align} L( \gamma, \phi, \sigma; \mathbf{l}, \mathbf{d}) = \prod_{i} f(l_i ; d_i, \gamma, \phi, \sigma) = \frac{1}{(2\pi \sigma^2)^{n/2}}\,\exp\left[-\frac{1}{2\sigma^2}\sum_{i}(l_i - \mu_i)^2\right], \end{align}
with
\begin{align} \mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}} \end{align}.
Notice that the parameters \(\gamma\) and \(\phi\) only appear in the parameter \(\mu\). Therefore, to find the parameter values of \(\gamma\) and \(\phi\) that maximize the likelihood, we need only to find the values that minimize
\begin{align} \text{RSS} \equiv \sum_{i}(l_i - \mu_i)^2, \end{align}
a quantity known as the residual sum of squares. This is true for whatever value \(\sigma\) takes. So, assume for a moment we find the MLEs \(\gamma^*\) and \(\phi^*\), which give us a residual sum of squares of \(\text{RSS}^*\). Then, the log-likelihood is
\begin{align} \ell(\gamma^*, \phi^*, \sigma) = -\frac{n}{2}\,\ln 2\pi - n\ln\sigma - \frac{\text{RSS}^*}{2\sigma^2}. \end{align}
The value of \(\sigma\) that maximizes the log-likelihood is then found by differentiating and setting to zero.
\begin{align} \frac{\partial \ell}{\partial \sigma} = -\frac{n}{\sigma} + \frac{\text{RSS}^*}{\sigma^3} = 0. \end{align}
This is solved to give \(\sigma^* = \sqrt{\text{RSS}^*/n}\).
So, we can split the optimization problem in two parts. First, find the values of parameters \(\gamma\) and \(\phi\) that minimize the residual sum of squares. Then, using these values, compute the MLE for \(\sigma\) using the analytical result we have just derived.
It turns out that this is quite powerful because the optimization problem for minimizing the residual sum of squares has nice properties. Importantly, we can use the powerful Levenberg-Marquardt algorithm to perform the optimization. A variant of this is implemented in scipy.optimization.least_squares()
(docs). To use that algorithm
you need to define a function for the residual; that is a function that returns the quantity \(l_i - \mu_i\). The function has call signature f(params, *args)
, where params
is a Numpy array containing the parameters you will use in the optimization (in our case, \(\gamma\) and \(\phi\)), and args
is a tuple of any other arguments passed to the function, which almost always include the data (in our case (d, ell)
). We can code up that function.
[8]:
def resid(params, d, ell):
"""Residual for spindle length model."""
return ell - theor_spindle_length(*params, d)
We can call scipy.optimize.least_squares()
to perform the MLE. We can also specify bounds as a tuple containing a list of the lower bounds of the respective parameters and a list of the upper bounds of the respective parameters. We bound \(\gamma\) between zero and one, and \(\phi\) just has to be positive.
[9]:
res = scipy.optimize.least_squares(
resid, np.array([0.5, 35]), args=(d, ell), bounds=([0, 0], [1, np.inf])
)
# Compute residual sum of squares from optimal params
rss_mle = np.sum(resid(res.x, d, ell)**2)
# Compute MLE for sigma
sigma_mle = np.sqrt(rss_mle / len(d))
# Take a look
res.x, sigma_mle
[9]:
(array([ 0.8601068 , 38.24676913]), 3.754999871544056)
Indeed, we got the same results as before. Let’s code this procedure up into a function. Looking ahead to using bebi103.bootstrap.draw_bs_reps_mle()
, we need the function to take the data set as a single argument.
[10]:
def spindle_mle_lstq(data):
"""Compute MLE for parameters in spindle length model."""
# Unpack data
d = data[:, 0]
ell = data[:, 1]
res = scipy.optimize.least_squares(
resid, np.array([0.5, 35]), args=(d, ell), bounds=([0, 0], [1, np.inf])
)
# Compute residual sum of squares from optimal params
rss_mle = np.sum(resid(res.x, d, ell)**2)
# Compute MLE for sigma
sigma_mle = np.sqrt(rss_mle / len(d))
return tuple([x for x in res.x] + [sigma_mle])
The big advantages of doing this are the robustness and speed of the Levenberg-Marquardt algorithm. Let’s do a speed test.
[11]:
print("Timing for MLE by Powell's method:")
%timeit spindle_mle(d, ell)
print("\nTiming for MLE by least_squares:")
data = df[['Droplet Diameter (um)', 'Spindle Length (um)']].values
%timeit spindle_mle_lstq(data)
Timing for MLE by Powell's method:
10.3 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Timing for MLE by least_squares:
2.77 ms ± 1.83 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That’s a substantial speed boost of nearly 4×.
Very important caveat
We just did something you may have heard of. We minimized the sum of the squares of the residuals. Notice that this depended on the model for the residuals being Normal. This does not in general work for all models! We will also find that this does not in general work in a Bayesian setting with non-Uniform priors, which is almost always the case.
Confidence intervals
We can now proceed to compute confidence intervals for our parameters. There are many ways to do this, including pairs, residual, and wild bootstrap, which we will omit here. We will instead do a parametric bootstrap here. We use the generative model parametrized by the MLE to regenerate data sets. For each value of \(d\) (which are fixed), we draw values of \(l\) using the generative distribution). Then we perform MLE of those to get confidence intervals. To use
bebi103.draw_bs_reps_mle()
, then, we just are left to supply the bootstrap sample generating function.
[12]:
def gen_spindle_data(params, d, size, rng):
"""Generate a new spindle data set."""
mu = theor_spindle_length(*params[:-1], d)
sigma = params[-1]
return np.vstack((d, rng.normal(mu, sigma))).transpose()
Note that we were careful to make sure the data set has the appropriate dimensions; the row indexes the trial. We can now compute our replicates and confidence intervals.
[13]:
bs_reps = bebi103.bootstrap.draw_bs_reps_mle(
spindle_mle_lstq,
gen_spindle_data,
data=df[["Droplet Diameter (um)", "Spindle Length (um)"]].values,
mle_args=(),
gen_args=(d,),
size=10000,
n_jobs=2,
progress_bar=False,
)
# Compute confidence intervals
conf_ints = np.percentile(bs_reps, [2.5, 97.5], axis=0)
# Print the results
print(
"""95% Confidence intervals
γ: [ {0:.2f}, {1:.2f}]
φ: [{2:.2f}, {3:.2f}]
σ: [ {4:.2f}, {5:.2f}]
""".format(
*conf_ints.T.ravel()
)
)
95% Confidence intervals
γ: [ 0.83, 0.89]
φ: [37.53, 39.01]
σ: [ 3.55, 3.95]
The confidence intervals are given in each column for \(\gamma\), \(\phi\), and \(\sigma\), respectively, and are quite tight. This is because we have lots of data points. This does not mean that the data are tightly distributed about the theoretical curve, only that the MLE estimates for the parameters will not vary much if we repeated the experiment and analysis.
Computing environment
[14]:
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,bokeh,iqplot,bebi103,jupyterlab
Python implementation: CPython
Python version : 3.11.4
IPython version : 8.12.2
numpy : 1.24.3
scipy : 1.11.1
pandas : 2.0.3
bokeh : 3.2.1
iqplot : 0.3.4
bebi103 : 0.1.15
jupyterlab: 4.0.5