Variate-covariate models with MCMC

Data set download

[2]:

import polars as pl

import cmdstanpy
import arviz as az

import iqplot
import bebi103

import bokeh.io
bokeh.io.output_notebook()

Loading BokehJS ...

In Part 1, we considered models for microtubule size as a function of the size of droplets encapsulating them using data from Good, et al., Science, 342, 856-860, 2013. Here, we will use MCMC to obtain parameter estimates. We start by loading in the data set and making a quick plot.

[3]:

# Load in Data Frame
df = pl.read_csv(os.path.join(data_path, "good_invitro_droplet_data.csv"), comment_prefix="#")

# Pull out numpy arrays
ell = df["Spindle Length (um)"].to_numpy()
d = df["Droplet Diameter (um)"].to_numpy()

# Make a plot
p = bokeh.plotting.figure(
    frame_height=200,
    frame_width=300,
    x_axis_label="droplet diameter (µm)",
    y_axis_label="spindle length (µm)",
    x_range=[0, 250],
    y_range=[0, 50],
)

p.scatter(
    source=df.to_dict(),
    x="Droplet Diameter (um)",
    y="Spindle Length (um)",
    alpha=0.3,
)

bokeh.io.show(p)

Updated generative model

We will estimate parameters for the conserved tubulin model. In this model, the spindle length \(l\) is a function of droplet diameter \(d\) according to

\begin{align} l(d;\gamma, \phi) = \frac{\gamma d}{\left(1+(\gamma d/\phi)^3\right)^{\frac{1}{3}}}. \end{align}

We will model the variation off of this curve as Normally distributed (assuming homoscedasticity), such that our likelihood is defined as

\begin{align} l_i \mid d_i, \gamma, \phi, \sigma \sim \text{Norm}(l(d_i;\gamma, \phi), \sigma) \;\forall i. \end{align}

We will choose weakly informative priors. This is a departure from our approach the previous time we studied this data set in which we used Jeffreys priors. The Jeffreys priors allowed for an analytical treatment, but they did not codify our prior knowledge. If we wish to properly encode our prior knowledge in the prior, we usually have to sacrifice analytical tractability. But that’s ok, since we have MCMC!

We will build the prior for the three parameters \(\phi\), \(\gamma\), and \(\sigma\), taking each to be independent of the others. Let’s start with \(\sigma\). It is possible that the spindle size is very carefully controlled. It is also possible that it could be highly variable. So, we can choose a Half Normal prior for \(\sigma\) with a large scale parameter of 10 µm; \(\sigma \sim \text{HalfNorm}(10)\). For \(\gamma\), we know the physically is must be between zero and one (as we have worked out in previous lessons), so we will assume a prior on is close to uniform that domain, but with probability density dropping right at zero and one; \(\gamma \sim \text{Beta}(1.1, 1.1)\). For \(\phi\), the typical largest spindle size, we assume a smallest spindle of about a micron (right where I would start to feel uncomfortable betting the farm) and the largest spindle to be about a millimeter (the size of a very large cell, like a Xenopus egg). We therefore take \(\log_{10}\phi \sim \text{Norm}(1.5, 0.75)\), where \(\phi\) is in units of microns.

We thus have our complete generative model.

\begin{align} &\log_{10} \phi \sim \text{Norm}(1.5, 0.75),\\[1em] &\gamma \sim \text{Beta}(1.1, 1.1), \\[1em] &\sigma \sim \text{HalfNorm}(10),\\[1em] &\mu_i = \frac{\gamma d_i}{\left(1+(\gamma d_i/\phi)^3\right)^{\frac{1}{3}}}, \\[1em] &l_i \mid d_i, \gamma, \phi, \sigma \sim \text{Norm}(\mu_i, \sigma) \;\forall i. \end{align}

Using Stan to sample

The Stan code we will use is below.

functions {
  real ell_theor(real d, real phi, real gamma_) {
    real denom_ratio = (gamma_ * d / phi)^3;
    return gamma_ * d / cbrt(1 + denom_ratio);
  }
}


data {
  int N;
  array[N] real d;
  array[N] real ell;
}


parameters {
  real log10_phi;
  real<lower=0, upper=1> gamma_;
  real<lower=0> sigma;
}


transformed parameters {
  real phi = 10^log10_phi;
}


model {
  log10_phi ~ normal(1.5, 0.75);
  gamma_ ~ beta(1.1, 1.1);
  sigma ~ normal(0.0, 10.0);

  for (i in 1:N) {
    ell[i] ~ normal(ell_theor(d[i], phi, gamma_), sigma);
  }
}

I pause to point out a few syntactical elements.

Note how I used gamma_ as the variable name for gamma. This is because gamma is used as a distribution name in Stan.
I have used the functions block in this Stan program. The definition of the function is real ell_theor(real d, real phi, real gamma_) { }, where the code to be executed in the function is between the braces. This is how you declare a function in Stan. The first word, real specifies what data type is expected to be returned. Then comes the name of the function, followed by its arguments in parentheses preceded by their data type.
Within the function, I had to declare denom_ratio to be real. Remember, Stan is statically typed, so every variable needs a declaration of its type.
The return statement is like Python; return followed by a space and then an expression for the value to be returned.
There is no Half-Normal distribution in Stan. It is implemented by putting a bound on the variable (in this case real<lower=0> sigma;) and then model sigma as being Normally distributed with location parameter zero.

All right! Let’s do some sampling!

[4]:

data = dict(
    N=len(df),
    d=df["Droplet Diameter (um)"].to_numpy(),
    ell=df["Spindle Length (um)"].to_numpy(),
)

with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file='spindle.stan')

    samples = sm.sample(
        data=data,
        chains=4,
        iter_sampling=1000
    )

samples = az.from_cmdstanpy(samples)

With samples in hand, let’s make a plot of the posterior samples. Remember that when we plotted the posterior by brute force, we had to analytically marginalize out \(\sigma\) (which we could do because we used a Jeffreys prior). Here, we can just plot our samples of \(\phi\) and \(\gamma\) and ignore the \(\sigma\) samples. So, we can look at a corner plot and check out the \(\phi\)-\(\gamma\) plane.

[5]:

bokeh.io.show(
    bebi103.viz.corner(
        samples,
        parameters=[("phi", "ϕ"), ("gamma_", "γ"), ("sigma", "σ")],
        plot_ecdf=True,
    )
)

The ECDFs of the samples depicted in the corner plot enable quick reference to the marginalized credible intervals.

We now have a complete picture of the posterior, computed directly from our samples. We will eschew plotting best-fit lines because this is not a preferred method of visualization and defer that to when we do posterior predictive checks.

Vectorized models

In the above Stan code, we explicitly looped over the data point in calculating the theoretical spindle length and in assigning a Normal likelihood to the measured spindle length. While this works just fine, Stan also supports vectorization, which allows us to proceed without looping.

Consider the Stan code below.

functions {
  vector ell_theor(vector d, real phi, real gamma_, int N) {
    vector[N] denom_ratio = (gamma_ * d / phi).^3;
    return gamma_ * d ./ cbrt(1 + denom_ratio);
  }
}


data {
  int N;
  vector[N] d;
  vector[N] ell;
}


parameters {
  real log10_phi;
  real<lower=0, upper=1> gamma_;
  real<lower=0> sigma;
}


transformed parameters {
  real phi = 10^log10_phi;
}


model {
  log10_phi ~ normal(1.5, 0.75);
  gamma_ ~ beta(1.1, 1.1);
  sigma ~ normal(0.0, 10.0);
  ell ~ normal(ell_theor(d, phi, gamma_, N), sigma);
}

Here, we have defined ell and d as vector data types. Further, the function ell_theor() returns a vector. A vector is a one-dimensional array of real numbers (strictly speaking, represented as a column vector in a linear algebraic sense). Multiplication of a scalar by a vector, as in gamma_ * d, results in elementwise multiplication of the entries of the vector by the scalar value. However, if we want to do elementise operations on two vectors, or raise a vector to a power, we need to put a . in front of the operator. For example, in the above code, we want to do elementize division of two vectors. We use the ./ operator for that. Similarly, to cube every entry of a vector, we use the .^ operator.

Also notice that we write

ell ~ normal(ell_theor(d, phi, gamma_, N), sigma)

Stan automatically assumes that each entry in the vector ell is Normally distributed about each entry in d in an i.i.d. manner.

Vectorized Stan code can be more concise and can provide a speed boost. The speed boost is modest in this case, and is often only modest, so there is not much lost in writing out for loops if that is more to your liking.

Let’s sample using the vectorized code to show that it gives the same results.

[6]:

with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file='spindle_vectorized.stan')

    samples = sm.sample(
        data=data,
        chains=4,
        iter_sampling=1000
    )

samples = az.from_cmdstanpy(samples)

And the visualization….

[7]:

bokeh.io.show(
    bebi103.viz.corner(
        samples,
        parameters=[("phi", "ϕ"), ("gamma_", "γ"), ("sigma", "σ")],
        plot_ecdf=True,
    )
)

[8]:

bebi103.stan.clean_cmdstan()

Computing environment

[9]:

%load_ext watermark
%watermark -v -p numpy,polars,cmdstanpy,arviz,bokeh,bebi103,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())

Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.25.0

numpy     : 1.26.4
polars    : 1.3.0
cmdstanpy : 1.2.4
arviz     : 0.19.0
bokeh     : 3.4.1
bebi103   : 0.1.22
jupyterlab: 4.0.13

cmdstan   : 2.35.0