Exercise 2.2: Conjugacy and the Beta-Binomial conjugate pair
[2]:
import numpy as np
import scipy.stats as st
import bokeh.plotting
import bokeh.io
bokeh.io.output_notebook()
We have talked about Bayes theorem as a model for learning. The idea there was that we know something before (a priori) acquiring data, and then we update our knowledge after (a posteriori). So, we come in with the prior and out with the posterior after acquiring data. It might make sense, then, that the prior and the posterior distributions have the same functional form. That is, the prior and the posterior have the same distribution, and the parameters of the distribution are updated from the prior to the posterior by the likelihood. When the prior and posterior have the same functional form, the prior is said to be conjugate to the likelihood. This seems pleasing: the likelihood serves to update the prior into the posterior, so it should determine the functional form of the prior/posterior such that they are the same.
Conjugate priors are especially useful because they enable analytical calculation of the posterior, typically without have to do any mathematics yourself. This is perhaps best seen through example.
As a motivating example of use of a conjugate prior, we will use data from one of the experiments from a paper by Mossman, et al., 2019, in which the authors investigated the age of Drosophila parents on the viability of the eggs of their offspring. In one vial, they mated young (≤ 3 days old) males with old females (≥ 45 days old). Of the 176 eggs laid by their offspring over a 10-day period, 94 of them hatched, while the remainder failed to hatch. In another vial, they mated young males with young females. Of the 190 eggs laid by their offspring over a 10-day period, 154 hatched. (They took many more data than these, but we’re using this experiment as a demonstration of conjugacy.)
a) I contend that the number of eggs that hatched, \(n\), out of \(N\) total eggs can be models with a Binomial distribution;
\begin{align} n \mid N, \theta \sim \text{Binom}(N, \theta). \end{align}
Provide an argument as to why this is a good model.
b) Writing out the PMF of the Beta distribution,
\begin{align} f(n\mid N, \theta) = \begin{pmatrix}N\\n\end{pmatrix} \theta^n(1-\theta)^{N-n}. \end{align}
This identifies a single parameter to estimate, \(\theta\), which is the probability of hatching. Our goal, then, is to understand the posterior distribution \(g(\theta \mid n, N)\). Naturally, we need to specify a prior to complete our model. In this part of the problem, your task is to prove that the Beta distribution is conjugate to the Binomial likelihood. Its PDF is
\begin{align} g(\theta \mid \alpha, \beta) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta-1}}{B(\alpha,\beta)}, \end{align}
where \(B(\alpha,\beta)\) is a beta function,
\begin{align} B(\alpha,\beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}. \end{align}
Note that there are two parameters, \(\alpha\) and \(\beta\) that parametrize the prior. If in fact the Beta distribution is conjugate to the Binomial, it is these two parameters that get updated by the data.
You should explore the Beta distribution in the Distribution Explorer to get a feel for its shape and how the parameters \(\alpha\) and \(\beta\) affect its shape. Importantly, if \(\alpha = \beta = 1\), we get a Uniform distribution on the interval [0, 1].
Now, prove that if we have
\begin{align} \theta \mid \alpha, \beta \sim \text{Beta}(\alpha, \beta), \\[1em] n \mid N, \theta \sim \text{Binom}(N, \theta), \end{align}
that the posterior is also Beta distributed;
\begin{align} \theta \mid n, N, \alpha, \beta \sim \text{Beta}(\alpha_\mathrm{post}, \beta_\mathrm{post}). \end{align}
In so doing, derive expressions for the updated parameters of the Beta distribution, \(\alpha_\mathrm{post}\) and \(\beta_\mathrm{post}\).
c) You now have the whole posterior analytically, and you can plot it! Plot the posterior distributions for \(\theta\) for both young mothers and old mothers.
d) The maximum likelihood estimate for \(\theta\), as we worked out in part 1 of the workshop, is \(\theta_{\mathrm{MLE}} = n / N\). Show that the maximum a posteriori estimate for \(\theta\), \(\theta_\mathrm{MAP}\), is also \(n/N\) for a uniform prior (\(\alpha = \beta = 1\)), but is not in general given by \(n/N\). Any prior information does come into the posterior.