Building a generative model

The process of model building usually involves starting with a cartoon model, mathematizing it, and the forming that into a statistical model to model noise in measurement. Sometimes this process is very simple, and sometimes it involves careful and difficult modeling.

Probability distributions: The building blocks

The building blocks of a generative model are probability distributions. A given probability distribution has a story, a description of how data are generated from a distribution. In many cases, matching a theoretical model for your system of interest to the story of a named distribution is sufficient to build a generative model. In other cases, you may need to convert the story of your theoretical model into a mathematical representation of a probability distribution, usually as a probability density or probability mass function.

In describing stories of distributions, the concepts of a Bernoulli trial and of a Poisson process are useful.

Bernoulli trial

A Bernoulli trial is an experiment that has two outcomes that can be encoded as success (\(y=1\)) or failure (\(y = 0\)). The words “success” and “failure” do not necessarily mean positive or negative outcomes as they appeal to human emotion. They are just names for the encodings of the outcomes.

Poisson process

Rare events occur with a rate \(\lambda\) per unit time. There is no “memory” of previous events; i.e., that rate is independent of time. A process that generates such events is called a Poisson process. The occurrence of a rare event in this context is referred to as an arrival.

Importantly, (bio)chemical processes are often well-described as Poisson processes.

Check out the stories, PDFs, and PMF in the Distribution explorer

In addition to syntax for sampling out of distributions, the Distribution Explorer gives stories, PDFs, PMFs, and other useful information about many named distributions. I encourage you to read through it to help build a toolbox for generative modeling.

Examples of generative models

It is often easiest to learn from example. I present here some examples of how we might come up with generative models.

The size of eggs laid by C. elegans

The experiment here is repeated measurements of the length of eggs laid by C. elegans worms. We do not pretend to know much about how the process of egg generation sets its length. Surely many processes are involved, and we choose to model the egg length as being Normally distributed, as this story roughly matches what we would expect. We further assume that the length of each egg we measure is independent of all of the other eggs we measure, and further that the distribution we use to describe the egg length of any given egg is the same as any other. That is to say that the egg lengths are independent and identically distributed, abbreviated i.i.d.

We can then write down the probability density function for the length of egg \(i\), \(y_i\), as

\[\begin{align} f(y_i; \mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}\,\mathrm{e}^{-(y_i-\mu)^2/2\sigma^2}, \end{align}\]

which is the PDF for a Normal distribution. Since each measurement is independent, the PDF for the joint distribution of all measurements, \(\mathbf{y} = \{y_1, y_2, \ldots, y_n\}\), is given by the product of the PDFs of the individual measurements.

\[\begin{split}\begin{align} f(\mathbf{y}; \mu, \sigma) &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}}\,\mathrm{e}^{-(y_i-\mu)^2/2\sigma^2} \\[1em] &= \left(\frac{1}{2\pi \sigma^2}\right)^{n/2}\,\exp\left[-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\mu)^2\right]. \end{align}\end{split}\]

The PDF has all of the information for the generative model. Importantly, the statistical model dictates what parameters you are trying to estimate. In this case, there are two parameters, \(\mu\) and \(\sigma\). The generative model tells us that we can infer the characteristic egg length \(\mu\) and the variance, \(\sigma^2\).

In this, model, we skipped through the cartoon model, through mathematization, and went directly to the generative statistical model, since the former two models are trivial.

Short-hand model definition

Writing out the mathematical expression for the PDF can be cumbersome, even for a relatively simple model like we have here. In English, the model is, “The egg lengths are i.i.d. and are Normally distributed with mean \(\mu\) and standard deviation \(\sigma\).” A shorthand for this is

\[\begin{align} y_i \sim \text{Norm}(\mu, \sigma) \;\forall i. \end{align}\]

This is read just like the English sentence describing the model. The tilde symbol means “is distributed as.”

The amount of time before microtubule catastrophe

We has already discussed the experiment of Gardner, et al., in which they measured the time to catastrophe of microtubules. Let us consider here three models for that experiment.

Catastrophe as a Poisson process

We may consider the case where catastrophe is itself a Poisson process (or triggered by the arrival of a single Poisson process). Written in shorthand, this is

\[\begin{align} &t_i \sim \text{Expon}(\beta) \;\forall i. \end{align}\]

Implicit in this expression is that we are modeling catastrophe events as i.i.d. There is a single parameter here, \(\beta\), which is the rate of arrivals of catastrophe. The PDF may be written as

\[\begin{align} f(\mathbf{t};\beta) = \beta^n\,\prod_{i=1}^n\mathrm{e}^{-\beta t_i} = \beta^n\exp\left[-\beta \sum_{i=1}^nt_i\right] = \beta^n\,\mathrm{e}^{n\beta\bar{t}}. \end{align}\]

where \(\mathbf{t} = \{t_1, t_2, \ldots, t_n\}\), and \(\bar{t}\) is the arithmetic mean of the measured times to catastrophe.

Catastrophe as a Gamma distribution

Gardner and coworkers modeling catastrophe times as being Gamma distributed. The related story is that catastrophe occurs after a series of \(\alpha\) Poisson processes, all arriving at the sae rate, arrives. That is to say that several chemical events must happen in succession to trigger catastrophe. Again denoted the rate of arrival of a Poisson process as \(\beta\), the model is then

\[\begin{align} t_i \sim \text{Gamma}(\alpha, beta) \;\forall i \end{align}\]

The PDF is

\[\begin{align} f(\mathbf{t};\alpha,\beta) = \left(\frac{1}{\Gamma(\alpha)}\right)^n\prod_{i=1}^n\frac{(\beta t_i)^\alpha}{t_i}\,\mathrm{e}^{-\beta t_i}. \end{align}\]

Note that the \(\alpha = 1\) special case yields the previously defined Exponential model. In general, it is a good idea to have various candidate models be limits of special cases of each other so that complexity is added systematically as you are doing your modeling.

Catastrophe as a two-step process

In the Gamma model, we stipulate that each of the \(\alpha\) steps come from Poisson processes of the same rate. As we will see when we do maximum likelihood estimation, we can fudge this a bit by stipulating that \(\alpha\) need not be integer. We may want to choose a perhaps more realistic model in which the events leading to catastrophe from from Poisson processes of different rates. Though we may derive a probability density function for \(\alpha\) arrivals, we will instead derive it for two for simplicity.

Assume the first process arrives at time \(t_1\). The second process then has the possibility to arrive, so we must wait a time \(t-t_1\) for it to arrive at time \(t\). Because the arrival times of the respective events are independent, the joint probability density function is

\[\begin{align} f(t, t_1) = f(t-t_1)\,f(t_1). \end{align}\]

The waiting times are Exponentially distributed, so we have

\[\begin{align} f(t, t_1) = \beta_2\mathrm{e}^{-\beta_2(t-t_1)}\,\beta_1\mathrm{e}^{-\beta_1 t_1}. \end{align}\]

To get the PDF \(f(t)\), then, we marginalize over all of the values that \(t_1\) can take, which is all values from zero to \(t\).

\[\begin{split}\begin{align} f(t) &= \int_0^t\mathrm{d}t_1\,f(t, t_1) \\[1em] &= \int_0^t \mathrm{d}t_1\,\beta_2\mathrm{e}^{-\beta_2(t-t_1)}\,\beta_1\mathrm{e}^{-\beta_1 t_1} \\[1em] &= \beta_1\beta_2\mathrm{e}^{-\beta_2 t} \int_0^t \mathrm{d}t_1\,\mathrm{e}^{-(\beta_1 - \beta_2)t_1} \\[1em] &= \frac{\beta_1\beta_2\,\mathrm{e}^{-\beta_2 t}}{\beta_1-\beta_2}\left(1 - \mathrm{e}^{-(\beta_1 - \beta_2)t}\right) \\[1em] &= \frac{\beta_1\beta_2}{\beta_2-\beta_1}\left(\mathrm{e}^{-\beta_1 t} - \mathrm{e}^{-\beta_2 t}\right). \end{align}\end{split}\]

Note that in the limit of \(\beta_1 \to \beta_2 \equiv \beta\), this becomes

\[\begin{align} f(t) &= \beta^2\,t\,\mathrm{e}^{-\beta t}, \end{align}\]

which is the PDF of a Gamma distribution with parameter \(\alpha = 2\).

This model is more difficult to write in shorthand, but we can.

\[\begin{split}\begin{align} &t'_i \sim \text{Expon}(\beta_1) \;\forall i,\\[1em] &t_i - t'_i \sim \text{Expon}(\beta_2) \;\forall i. \end{align}\end{split}\]

Note that this construction of the model has a latent variable, \(t_i'\), a random variable that we can define in the model, but we cannot measure.

The change in bacterial mass over time

You may be familiar with exponential microbial growth. When you put a single cell in growth media, it divides, and then you have two. Those two cells then grow and divide, giving four cells. This continues, and the number of cells grows exponentially with time.

In an interesting paper (PNAS, 2014), Iyer-Biswas and coworkers addressed the question of whether or not a single cell exhibits exponential growth (not to be confused with the Exponential distribution). That is, right after a division, does the total mass of a cell grow exponentially before dividing? Even if individual cells grow linearly, in bulk growth will still appear exponential, so we cannot really tell from a growth experiment.

Their clever experimental set-up allows imaging of single dividing cells in conditions that are identical through time. This is accomplished by taking advantage of a unique morphological feature of Caulobacter. The mother cell is adherent to the a surface through its stalk. Upon division, one of the daughter cells does not have a stalk and is mobile. The system is part of a microfluidic device that gives a constant flow. So, every time a mother cell divides, the un-stalked daughter cell gets washed away. In such a way, the dividing cells are never in a crowded environment and the buffer is always fresh. Using microscopy and image processing, they have many curves measuring the areas of cells in images (assumed to be proportional to the cell mass), starting from a single mother cell with its growth to division, to assess growth models. The data look like this:

Bokeh Plot

We can consider two models for growth of an individual cell, linear growth and exponential growth.

Linear growth

We will start with linear growth; stating that the growth is linear is the cartoon model. More precisely, we model bacterial growth as a constant process for each bacterium; it grows at the same rate regardless of bacterial mass. We can mathematize our model as

\[\begin{align} a(t) = a^0 + b t, \end{align}\]

where \(a(t)\) is the area of the bacterium over time, and \(t\) is the time since the last cell division. So, we now have our mathematical model. The growth rate is \(b\), and the area immediately after the last cell division is \(a^0\).

For the statistical model, we need to model error in measurement. The idea is that the cell grows according to the above equation, but there will be some natural stochastic variation away from that curve. Furthermore, there are errors in measurement for the area at each time point. (We assume that we can measure the time exactly without error.) Thus, the measured area \(a_i\) for a bacterium at time point \(t_i\) is

\[\begin{align} a_i = a^0 + b t_i + e_i, \end{align}\]

where \(e_i\) is the variation in the measurement from the mathematical model, called a residual. To complete the statistical model, we need to specify how \(e_i\) is distributed, and also the relationship between different time points. We first consider the latter. In time series analysis, the value (in this case the area) at time point \(t_{i+1}\) may be influenced by some memory process by the value at time point \(t_i\). Nonetheless, we often model measurements at different time points as i.i.d., only being connected with those at previous times by virtue of the fact that there is explicit time dependence in the mathematical model. This is typically a reasonable assumption, as many processes are memoryless.

Given that the measurements are i.i.d., we can model the residual, \(e_i\). This is commonly modeled as Normal with mean zero and some finite variance. If that variance is the same for all time points, the residuals are said to be homoscedastic. If the variance changes over time, we have heteroscedasticity. So, if we assume homoscedastic error, we could write

\[\begin{align} f(e_i;\sigma) = \frac{1}{\sqrt{2\pi \sigma^2}} \mathrm{e}^{-e_i^2/2\sigma^2}\;\forall i. \end{align}\]

We can then write the PDF for the joint distribution of all of the measured data, \((\mathbf{t}, \mathbf{a})\),

\[\begin{align} f(\mathbf{a};\mathbf{t},a^0, b, \sigma) = \left(\frac{1}{2\pi\sigma^2}\right)^{n/2}\exp\left[-\frac{1}{2\sigma^2}\sum_{i=1}^n(a_i - a^0-bt_i)^2\right]. \end{align}\]

It is convenient to write this in shorthand.

\[\begin{split}\begin{align} &a_i = a^0 + b t_i + e_i \;\forall i,\\[1em] &e_i \sim \text{Norm}(0, \sigma)\;\forall i, \end{align}\end{split}\]

or, equivalently,

\[\begin{align} a_i \sim \text{Norm}(a^0 + b t_i, \sigma)\;\forall i. \end{align}\]

Exponential growth

We can use exactly the same logic as above to write the model for Exponential growth.

\[\begin{split}\begin{align} &a_i = a^0 \mathrm{e}^{kt} + e_i \;\forall i,\\[1em] &e_i \sim \text{Norm}(0, \sigma)\;\forall i, \end{align}\end{split}\]

or, equivalently,

\[\begin{align} a_i \sim \text{Norm}(a^0 \mathrm{e}^{kt}, \sigma)\;\forall i. \end{align}\]

Important notes on generative modeling

In the three example models presented here, we used our best scientific and statistical insights to put forward a generative model. The model for linear growth of bacteria is in some sense “standard,” in that it leads to linear regression, a widely-used statistical tool. Nonetheless, your modeling should be bespoke. You should choose models that are appropriate for the experiment and data you are analyzing.