Lesson 39: Performing regressions

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from a Jupyter notebook. You can download the notebook here.



In [1]:
import numpy as np
import pandas as pd

# We'll use scipy.optimize.curve_fit to do the nonlinear regression
import scipy.optimize

import altair as alt

Bicoid and its gradient

In this tutorial, we will perform a nonlinear regression on data from a classic paper by Driever and Nüsslein-Volhard (Cell, 54, 95–104, 198890183-3)). In this paper, the authors identified Bicoid the first morphogen, a chemical substance that controls cell fate in a concentration dependent manner. In the context of Drosophila embryogenesis, Bicoid (Bcd) exhibits a concentration gradient, with a high concentration in the anterior region of the embryo and a low concentration in posterior region. Below is an image of the Bicound gradient from an immunostaining experiment from their paper.

Driever and Nüsslein-Volhard quantified this gradient by measuring the darkness of the immunostain as a function of distance between the anterior and posterior of the embryo. Our goal in this lesson is to perform a regression to get the characteristic length of the Bcd gradient. Specifically, based on physical modeling we will not go into here, we expect the Bcd gradient to be exponential, or

\begin{align} c(x) = c_0 \mathrm{e}^{-x/\lambda}, \end{align}

where $c(x)$ is the Bcd concentration at position $x$, with $x = 0$ at the anterior and $x = 1$ at the posterior. I.e., $x$ is in units of the total AP axis length. The Bcd concentration at the anterior is $c_0$, and the decay length, or characteristic length, of the gradient is $\lambda$. We assume that the intensity of the immunostain is

\begin{align} I = a + b c, \end{align}

where $a$ is the background signal and the immunostain intensity varies linearly with Bcd concentration. Thus, Bcd gradient, as visualized by immunostaining, is

\begin{align} I(x) = a + I_0\mathrm{e}^{-x/\lambda}.\\[1em] \phantom{blah} \end{align}

The data set

We will use the original data from Driever and Nüsslein-Volhard, available in ~/git/data/bcd_gradient.csv. We will first look at the file to see how we should import it.

In [2]:
!head data/bcd_gradient.csv
# Data taken from Fig. 3A of Driever and Nuesslein-Volhard, Cell,
# 54, 95-104, 1988.  This is the normalized immunostain intensity
# as a function of distance along the A-P axis in a wild type
# embryo.
fractional distance from anterior,[bcd] (a.u.)
0.0026446672455988007,0.86309692805632854
0.034222783610957611,0.86309692805632854
0.068445567221915221,0.74572710237919493
0.10262887818741613,0.65487479416273942
0.13420699455277493,0.54511384929873385

We see that this is a CSV file with two columns and comments preceded by #. The header row is given. So, we know how to use pd.read_csv to import the data.

In [3]:
# Import data set
df = pd.read_csv('data/bcd_gradient.csv', comment='#')

# Inspect DataFrame
df.head()
Out[3]:
fractional distance from anterior [bcd] (a.u.)
0 0.002645 0.863097
1 0.034223 0.863097
2 0.068446 0.745727
3 0.102629 0.654875
4 0.134207 0.545114

Because Altair does not like brackets and periods, let's rename the bcd concentration column.

In [4]:
df = df.rename(columns={'[bcd] (a.u.)': 'bcd'})

Let's plot the data to see what we are dealing with.

In [5]:
dots = alt.Chart(df
    ).mark_point(
    ).encode(
        x='fractional distance from anterior:Q',
        y=alt.Y('bcd:Q', title='[Bcd] (a.u.)', scale=alt.Scale(domain=[0, 1])))

dots
Out[5]:

Using scipy.optimize.curve_fit() to perform nonlinear regression

We will use scipy.optimize.curve_fit to preform the regression. Since this is our first time, let's read the doc string to figure out how to use it.

In [6]:
scipy.optimize.curve_fit?
Signature: scipy.optimize.curve_fit(f, xdata, ydata, p0=None, sigma=None, absolute_sigma=False, check_finite=True, bounds=(-inf, inf), method=None, jac=None, **kwargs)
Docstring:
Use non-linear least squares to fit a function, f, to data.

Assumes ``ydata = f(xdata, *params) + eps``

Parameters
----------
f : callable
    The model function, f(x, ...).  It must take the independent
    variable as the first argument and the parameters to fit as
    separate remaining arguments.
xdata : An M-length sequence or an (k,M)-shaped array for functions with k predictors
    The independent variable where the data is measured.
ydata : M-length sequence
    The dependent data --- nominally f(xdata, ...)
p0 : None, scalar, or N-length sequence, optional
    Initial guess for the parameters.  If None, then the initial
    values will all be 1 (if the number of parameters for the function
    can be determined using introspection, otherwise a ValueError
    is raised).
sigma : None or M-length sequence or MxM array, optional
    Determines the uncertainty in `ydata`. If we define residuals as
    ``r = ydata - f(xdata, *popt)``, then the interpretation of `sigma`
    depends on its number of dimensions:

        - A 1-d `sigma` should contain values of standard deviations of
          errors in `ydata`. In this case, the optimized function is
          ``chisq = sum((r / sigma) ** 2)``.

        - A 2-d `sigma` should contain the covariance matrix of
          errors in `ydata`. In this case, the optimized function is
          ``chisq = r.T @ inv(sigma) @ r``.

          .. versionadded:: 0.19

    None (default) is equivalent of 1-d `sigma` filled with ones.
absolute_sigma : bool, optional
    If True, `sigma` is used in an absolute sense and the estimated parameter
    covariance `pcov` reflects these absolute values.

    If False, only the relative magnitudes of the `sigma` values matter.
    The returned parameter covariance matrix `pcov` is based on scaling
    `sigma` by a constant factor. This constant is set by demanding that the
    reduced `chisq` for the optimal parameters `popt` when using the
    *scaled* `sigma` equals unity. In other words, `sigma` is scaled to
    match the sample variance of the residuals after the fit.
    Mathematically,
    ``pcov(absolute_sigma=False) = pcov(absolute_sigma=True) * chisq(popt)/(M-N)``
check_finite : bool, optional
    If True, check that the input arrays do not contain nans of infs,
    and raise a ValueError if they do. Setting this parameter to
    False may silently produce nonsensical results if the input arrays
    do contain nans. Default is True.
bounds : 2-tuple of array_like, optional
    Lower and upper bounds on parameters. Defaults to no bounds.
    Each element of the tuple must be either an array with the length equal
    to the number of parameters, or a scalar (in which case the bound is
    taken to be the same for all parameters.) Use ``np.inf`` with an
    appropriate sign to disable bounds on all or some parameters.

    .. versionadded:: 0.17
method : {'lm', 'trf', 'dogbox'}, optional
    Method to use for optimization.  See `least_squares` for more details.
    Default is 'lm' for unconstrained problems and 'trf' if `bounds` are
    provided. The method 'lm' won't work when the number of observations
    is less than the number of variables, use 'trf' or 'dogbox' in this
    case.

    .. versionadded:: 0.17
jac : callable, string or None, optional
    Function with signature ``jac(x, ...)`` which computes the Jacobian
    matrix of the model function with respect to parameters as a dense
    array_like structure. It will be scaled according to provided `sigma`.
    If None (default), the Jacobian will be estimated numerically.
    String keywords for 'trf' and 'dogbox' methods can be used to select
    a finite difference scheme, see `least_squares`.

    .. versionadded:: 0.18
kwargs
    Keyword arguments passed to `leastsq` for ``method='lm'`` or
    `least_squares` otherwise.

Returns
-------
popt : array
    Optimal values for the parameters so that the sum of the squared
    residuals of ``f(xdata, *popt) - ydata`` is minimized
pcov : 2d array
    The estimated covariance of popt. The diagonals provide the variance
    of the parameter estimate. To compute one standard deviation errors
    on the parameters use ``perr = np.sqrt(np.diag(pcov))``.

    How the `sigma` parameter affects the estimated covariance
    depends on `absolute_sigma` argument, as described above.

    If the Jacobian matrix at the solution doesn't have a full rank, then
    'lm' method returns a matrix filled with ``np.inf``, on the other hand
    'trf'  and 'dogbox' methods use Moore-Penrose pseudoinverse to compute
    the covariance matrix.

Raises
------
ValueError
    if either `ydata` or `xdata` contain NaNs, or if incompatible options
    are used.

RuntimeError
    if the least-squares minimization fails.

OptimizeWarning
    if covariance of the parameters can not be estimated.

See Also
--------
least_squares : Minimize the sum of squares of nonlinear functions.
scipy.stats.linregress : Calculate a linear least squares regression for
                         two sets of measurements.

Notes
-----
With ``method='lm'``, the algorithm uses the Levenberg-Marquardt algorithm
through `leastsq`. Note that this algorithm can only deal with
unconstrained problems.

Box constraints can be handled by methods 'trf' and 'dogbox'. Refer to
the docstring of `least_squares` for more information.

Examples
--------
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from scipy.optimize import curve_fit

>>> def func(x, a, b, c):
...     return a * np.exp(-b * x) + c

Define the data to be fit with some noise:

>>> xdata = np.linspace(0, 4, 50)
>>> y = func(xdata, 2.5, 1.3, 0.5)
>>> np.random.seed(1729)
>>> y_noise = 0.2 * np.random.normal(size=xdata.size)
>>> ydata = y + y_noise
>>> plt.plot(xdata, ydata, 'b-', label='data')

Fit for the parameters a, b, c of the function `func`:

>>> popt, pcov = curve_fit(func, xdata, ydata)
>>> popt
array([ 2.55423706,  1.35190947,  0.47450618])
>>> plt.plot(xdata, func(xdata, *popt), 'r-',
...          label='fit: a=%5.3f, b=%5.3f, c=%5.3f' % tuple(popt))

Constrain the optimization to the region of ``0 <= a <= 3``,
``0 <= b <= 1`` and ``0 <= c <= 0.5``:

>>> popt, pcov = curve_fit(func, xdata, ydata, bounds=(0, [3., 1., 0.5]))
>>> popt
array([ 2.43708906,  1.        ,  0.35015434])
>>> plt.plot(xdata, func(xdata, *popt), 'g--',
...          label='fit: a=%5.3f, b=%5.3f, c=%5.3f' % tuple(popt))

>>> plt.xlabel('x')
>>> plt.ylabel('y')
>>> plt.legend()
>>> plt.show()
File:      ~/anaconda3/lib/python3.6/site-packages/scipy/optimize/minpack.py
Type:      function

We see that the function prototype is

scipy.optimize.curve_fit(f, xdata, ydata, p0=None)

and that f is the function we wish to use to fit our data. The function f must have a prototype f(t, *p), where the first parameter is the dependent variable and the remaining parameters are those to be determined by performing the regression.

Defining the model function

So, our first step for using scipy.optimize.curvefit() is to define the fit function f, which we will call bcd_gradient_model. We have three parameters, $I_0$, $a$, and $\lambda$.

In [7]:
def bcd_gradient_model_first_try(x, I_0, a, lam):
    """Model for Bcd gradient: exponential decay plus background"""
    return a + I_0 * np.exp(-x/lam)

Notice that we used lam as the name of the parameter $\lambda$. This is because lambda is a keyword in Python. Never name a variable lambda. This is so important, I'm going to make the point more fervently, even though we already made it in a previous lesson.

**Never** name a Python variable the same as a keyword!


To see what the keywords are, do this: `import keyword; print(keyword.kwlist)`

There is a problem with our function, though. It is really only defined is all of its arguments are positive. We therefore should assure that all arguments are positive.

In [8]:
def bcd_gradient_model(x, I_0, a, lam):
    """Model for Bcd gradient: exponential decay plus background"""
    
    if np.any(np.array(x) < 0):
        raise RuntimeError('All values of `x` must be >= zero.')
    if np.any(np.array([I_0, a, lam]) < 0):
        raise RuntimeError('All parameters must be >= 0.')
    
    return a + I_0 * np.exp(-x / lam)

Initial guess

With our fit function in place, we now need to supply initial guesses for the parameter values, given by the kwarg p0. (We don't have to do this, but scipy.optimize.curve_fit() will guess a value of 1 for all parameters, which is generally not a good idea. You should always explicitly supply your own initial guesses.) In looking at the plot, we see that we indeed have a nonzero background signal, somewhere around $a \approx 0.2$. We also see that $I_0 \approx 0.9$ and $\lambda \approx 0.3$. We would normally use these as our approximate guesses, but to show an additional lesson, we will guess $\lambda = 1$, making the solver do a little more work.

In [9]:
# Specify initial guess
I_0_guess = 0.9
a_guess = 0.2
lam_guess = 1.0

# Construct initial guess array
p0 = np.array([I_0_guess, a_guess, lam_guess])

Performing the regression (first try)

When doing the curve fit, we see that it returns the optimal parameter values as well as an estimate of the covariance matrix. For reasons I will not discuss here, this covariance matrix has some assumptions under the hood that are not always appropriate, so we will just ignore it. Now we're ready to do the regression!

In [10]:
# Do curve fit, but dump covariance into dummy variable
p, _ = scipy.optimize.curve_fit(bcd_gradient_model,
                                df['fractional distance from anterior'],
                                df['bcd'],
                                p0=p0)

# Print the results
print("""
I_0 = {0:.2f}
  a = {1:.2f}
  λ = {2:.2f}
""".format(*tuple(p)))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-10-7e6f5b0c28c6> in <module>()
      3                                 df['fractional distance from anterior'],
      4                                 df['bcd'],
----> 5                                 p0=p0)
      6 
      7 # Print the results

~/anaconda3/lib/python3.6/site-packages/scipy/optimize/minpack.py in curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma, check_finite, bounds, method, jac, **kwargs)
    749         # Remove full_output from kwargs, otherwise we're passing it in twice.
    750         return_full = kwargs.pop('full_output', False)
--> 751         res = leastsq(func, p0, Dfun=jac, full_output=1, **kwargs)
    752         popt, pcov, infodict, errmsg, ier = res
    753         cost = np.sum(infodict['fvec'] ** 2)

~/anaconda3/lib/python3.6/site-packages/scipy/optimize/minpack.py in leastsq(func, x0, args, Dfun, full_output, col_deriv, ftol, xtol, gtol, maxfev, epsfcn, factor, diag)
    392         with _MINPACK_LOCK:
    393             retval = _minpack._lmdif(func, x0, args, full_output, ftol, xtol,
--> 394                                      gtol, maxfev, epsfcn, factor, diag)
    395     else:
    396         if col_deriv:

~/anaconda3/lib/python3.6/site-packages/scipy/optimize/minpack.py in func_wrapped(params)
    461     if transform is None:
    462         def func_wrapped(params):
--> 463             return func(xdata, *params) - ydata
    464     elif transform.ndim == 1:
    465         def func_wrapped(params):

<ipython-input-8-150434906ddb> in bcd_gradient_model(x, I_0, a, lam)
      5         raise RuntimeError('All values of `x` must be >= zero.')
      6     if np.any(np.array([I_0, a, lam]) < 0):
----> 7         raise RuntimeError('All parameters must be >= 0.')
      8 
      9     return a + I_0 * np.exp(-x / lam)

RuntimeError: All parameters must be >= 0.

Oh no! We got an exception that we had negative parameters! This occurred because under the hood, scipy.optimize.curve_fit() tries many sets of parameter values as it searches for those that bring the theoretical curve closest to the observed data. (It is way more complicated than that, but we won't get into that here.)

Performing the regression (second try)

At this point, we have a few options.

  1. Take out our error checking on positivity of parameter. The resulting curve with negative values of c or d will be so far off, the curve fit routine should come back toward physical parameter values after an excursion into non-physicality.
  2. Try using something other than scipy.optimize.curve_fit().
  3. Adjust our theoretical function by using the logarithm of the parameter values instead of the parameter values themselves. This ensures that all parameter are positive.

I think option 1 is off the table. We generally want to avoid "shoulds" when programming. We do not want to just hope that the solver will work, even though nonphysical parameter values are encountered.

Depending on how many times I will need to do a particular type of curve fit, I typically prefer option 2. We will not do this in the bootcamp, but I often use a technique called Markov chain Monte Carlo.

We'll use option 3. This still allows the solver to consider a smooth function of the parameter values. So, let's define a function that takes the logarithms of the parameters and input.

In [11]:
def bcd_gradient_model_log_params(x, log_I_0, log_a, log_lam):
    """
    Model for Bcd gradient: exponential decay plus 
    background with log parameters.
    """
    
    # Exponentiate parameters
    I_0, a, lam = np.exp(np.array([log_I_0, log_a, log_lam]))
    
    return bcd_gradient_model(x, I_0, a, lam)

Now let's try our curve fit again. We need to make sure we convert our initial guesses into logarithms. Remember, now the solver will be working with logarithms of parameters.

In [12]:
# Construct initial guess array
log_p0 = np.log(p0)

# Do curve fit, but dump covariance into dummy variable
log_p, _ = scipy.optimize.curve_fit(bcd_gradient_model_log_params, 
                                    df['fractional distance from anterior'], 
                                    df['bcd'], 
                                    p0=log_p0)

# Get the optimal parameter values
p = np.exp(log_p)

# Print the results
print("""
I_0 = {0:.2f}
  a = {1:.2f}
  λ = {2:.2f}
""".format(*tuple(p)))
I_0 = 0.77
  a = 0.17
  λ = 0.19

Plotting the result

We can now generate a smooth curve defined by the optimal parameters we just found and plot it along with the data. To do this, we make a densely sampled array of $x$ values and then compute the fitting function for these values.

In [13]:
# Smooth x values (400 values between zero and one)
x_smooth = np.linspace(0, 1, 400)

# Compute smooth curve
bcd_smooth = bcd_gradient_model(x_smooth, *p)

# Make into a data frame
df_smooth = pd.DataFrame({'x': x_smooth, 'bcd': bcd_smooth})

# Smooth line
line = alt.Chart(df_smooth
    ).mark_line(
        color='darkgray',
    ).encode(
        x='x',
        y='bcd')

line + dots
Out[13]:

So, we have determined the length scale of the Bcd gradient: about 20% of the total embryo length.