Plug-in estimates

When we make measurements, we are observing samples out of an underlying, typically unknown, generative distribution. Take as an example measurements of the count of RNA transcripts of a given gene in individual cells. We are likely to observe counts where the probability mass function of the generative distribution are is high and unlikely to observe counts where the PMF of the generative distribution is low.

If we have a full understanding of the generative distribution, we have learned how the data were generated, and thereby have an understanding of the physical, chemical, or biological phenomena we are studying. Statistical inference involves deducing properties of these (unknown) generative distributions.

In this lecture, we will start with nonparametric inference, which is statistical inference where no model is assumed; conclusions are drawn from the data alone. The approach we will take is heavily inspired by Allen Downey’s wonderful book, Think Stats and from Larry Wasserman’s book All of Statistics.

The plug-in principle

Let’s first think about how to get an estimate for a parameter value, given the data. While what we are about to do is general, for now it is useful to have in your mind a concrete example. Imagine we have a data set that is a set of repeated measurements, such as the repeated measurements of lengths of eggs laid by C. elegans worms of a given genotype.

We could have a generative model in mind, and we will do this in coming lessons. Instead, we will assume we only know that there is a generative distribution, but nothing about what it may be. Let \(F(x)\) be the cumulative distribution function (CDF) for the generative distribution.

A statistical functional is a functional of the CDF, \(T(F)\). A parameter \(\theta\) of a probability distribution can be defined from a functional, \(\theta = T(F)\). For example, the mean, variance, and median are all statistical functionals.

\[\begin{split}\begin{aligned} &\text{mean} \equiv \mu = \int_{-\infty}^\infty \mathrm{d}x\,x\,f(x) = \int_{-\infty}^\infty \mathrm{d}F(x)\,x, \\[1em] &\text{variance} \equiv \sigma^2 = \int_{-\infty}^\infty \mathrm{d}x\,(x-\mu)^2\,f(x) = \int_{-\infty}^\infty \mathrm{d}F(x)\,(x-\mu)^2, \\[1em] &\text{median} \equiv m = F^{-1}(1/2).\end{aligned}\end{split}\]

Now, say we made a set of \(n\) measurements, \(\{x_1, x_2, \ldots x_n\}\). You can think of this as a set of C. elegans egg lengths if you want to have an example in your mind. We define the empirical cumulative distribution function, \(\hat{F}(x)\) from our data as

\[\begin{aligned} \hat{F}(x) = \frac{1}{n}\sum_{i=1}^n I(x_i \le x), \end{aligned}\]

with

\[\begin{split}\begin{aligned} I(x_i \le x) = \left\{ \begin{array}{ccl} 1 && x_i \le x \\[0.5em] 0 && x_i > x. \end{array} \right. \end{aligned}\end{split}\]

We have already seen this form of the ECDF when we were studying exploratory data analysis. Remember that the probability density function (PDF), \(f(x)\), is related to the CDF by

\[\begin{aligned} f(x) = \frac{\mathrm{d}F}{\mathrm{d}x}. \end{aligned}\]

We can then differentiate the ECDF to get the empirical density function, \(\hat{f}(x)\) as

\[\begin{aligned} \hat{f}(x) = \frac{1}{n}\sum_{i=1}^n \delta(x - x_i), \end{aligned}\]

where \(\delta(x)\) is the Dirac delta function.

With the ECDF (and empirical density function), we have now defined an empirical distribution that is dependent only on the data. We now define a plug-in estimate of a parameter \(\theta\) as

\[\begin{aligned} \hat{\theta} = T(\hat{F}). \end{aligned}\]

In other words, to get a plug-in estimate a parameter \(\theta\), we need only to compute the functional using the empirical distribution. That is, we simply “plug in” the empirical CDF for the actual CDF.

The plug-in estimate for the median is easy to calculate.

\[\begin{aligned} \hat{m} = \hat{F}^{-1}(1/2), \end{aligned}\]

or the middle-ranked data point. The plug-in estimate for the mean or variance seem at face to be a bit more difficult to calculate, but the following general theorem will help. Consider a functional of the form of an expectation value, \(r(x)\).

\[\begin{split}\begin{aligned} \int\mathrm{d}\hat{F}(x)\,r(x) &= \int \mathrm{d}x \,r(x)\, \hat{f}(x) = \int \mathrm{d}x\, r(x) \left[\frac{1}{n}\sum_{i=1}^n\delta(x - x_i)\right] \nonumber \\[1em] &= \frac{1}{n}\sum_{i=1}^n\int \mathrm{d}x \,r(x) \delta(x-x_i) = \frac{1}{n}\sum_{i=1}^n r(x_i). \end{aligned}\end{split}\]

A functional of this form is called a linear statistical functional. The result above means that the plug-in estimate for a linear functional of a distribution is the arithmetic mean of the observed \(r(x)\) themselves. The plug-in estimate of the mean, which has \(r(x) = x\), is

\[\begin{aligned} \hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i \equiv \bar{x}, \end{aligned}\]

where we have defined \(\bar{x}\) as the traditional sample mean (the arithmetic mean of the measured data), which we have just shown is the plug-in estimate. This plug-in estimate is implemented in the np.mean() function. The plug-in estimate for the variance is

\[\begin{aligned} \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2 = \frac{1}{n}\sum_{i=1}^n x_i^2 - \bar{x}^2. \end{aligned}\]

This plug-in estimate is implemented in the np.var() function.

Note that we are denoting the mean and variance as \(\mu\) and \(\sigma^2\), but these are not in general the parameters with the same common name and symbols from a Normal distribution. Any distribution has a first moment (called a mean) and a second central moment (called a variance), unless they do not exist, as is the case, e.g., with a Cauchy distribution. In this context, we denote by \(\mu\) and \(\sigma^2\) the mean and variance of the unknown underlying univariate generative distribution.

We can compute plug-in estimates for more complicated parameters as well. For example, for a bivariate distribution, the correlation between the two variables \(x\) and \(y\), is defined with

\[\begin{aligned} \rho = \frac{\left\langle (x-\mu_x)(y-\mu_y)\right\rangle}{\sigma_x \sigma_y}, \end{aligned}\]

where the expectation in the numerator is called the covariance between \(x\) and \(y\). It is of large magnitude of \(x\) and \(y\) vary together and close to zero if they are nearly independent of each other. The plug-in estimate for the correlation is

\[\begin{aligned} \hat{\rho} = \frac{\sum_i(x_i - \bar{x})(y_i-\bar{y})}{\sqrt{\left(\sum_i(x_i-\bar{x})^2\right)\left(\sum_i(y_i-\bar{y})^2\right)}}. \end{aligned}\]