Lesson 37: Performing regressions

(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

In [1]:
import numpy as np
import pandas as pd

# We'll use scipy.optimize.curve_fit to do the nonlinear regression
import scipy.optimize

import matplotlib.pyplot as plt
import seaborn as sns
rc={'lines.linewidth': 2, 'axes.labelsize': 18, 'axes.titlesize': 18}
sns.set(rc=rc)

# The following is specific Jupyter notebooks
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}

Bicoid and its gradient

In this tutorial, we will perform a nonlinear regression on data from a classic paper by Driever and Nüsslein-Volhard (Cell, 54, 95–104, 1988). In this paper, the authors identified Bicoid the first morphogen, a chemical substance that controls cell fate in a concentration dependent manner. In the context of Drosophila embryogenesis, Bicoid (Bcd) exhibits a concentration gradient, with a high concentration in the anterior region of the embryo and a low concentration in posterior region. Below is an image of the Bicound gradient from an immunostaining experiment from their paper.

Bicoid gradient

Driever and Nüsslein-Volhard quantified this gradient by measuring the darkness of the immunostain as a function of distance between the anterior and posterior of the embryo. Our goal in this lesson is to perform a regression to get the characteristic length of the Bcd gradient. Specifically, based on physical modeling we will not go into here, we expect the Bcd gradient to be exponential, or

\begin{align} c(x) = c_0 \mathrm{e}^{-x/\lambda}, \end{align}

where $c(x)$ is the Bcd concentration at position $x$, with $x = 0$ at the anterior and $x = 1$ at the posterior. I.e., $x$ is in units of the total AP axis length. The Bcd concentration at the anterior is $c_0$, and the decay length, or characteristic length, of the gradient is $\lambda$. We assume that the intensity of the immunostain is

\begin{align} I = a + b c, \end{align}

where $a$ is the background signal and the immunostain intensity varies linearly with Bcd concentration. Thus, Bcd gradient, as visualized by immunostaining, is

\begin{align} I(x) = a + I_0\mathrm{e}^{-x/\lambda}.\\[1em] \phantom{blah} \end{align}

The data set

We will use the original data from Driever and Nüsslein-Volhard, available in ~/git/data/bcd_gradient.csv. We will first look at the file to see how we should import it.

In [3]:
!head data/bcd_gradient.csv
# Data taken from Fig. 3A of Driever and Nuesslein-Volhard, Cell,
# 54, 95-104, 1988.  This is the normalized immunostain intensity
# as a function of distance along the A-P axis in a wild type
# embryo.
fractional distance from anterior,[bcd] (a.u.)
0.0026446672455988007,0.86309692805632854
0.034222783610957611,0.86309692805632854
0.068445567221915221,0.74572710237919493
0.10262887818741613,0.65487479416273942
0.13420699455277493,0.54511384929873385

We see that this is a CSV file with two columns and comments preceded by #. The header row is given. So, we know how to use pd.read_csv to import the data.

In [5]:
# Import data set
df = pd.read_csv('data/bcd_gradient.csv', comment='#')

# Inspect DataFrame
df.head()
Out[5]:
fractional distance from anterior [bcd] (a.u.)
0 0.002645 0.863097
1 0.034223 0.863097
2 0.068446 0.745727
3 0.102629 0.654875
4 0.134207 0.545114

For ease of access, we can rename the columns. We'll call them x and I_bcd.

In [9]:
# Rename columns
df = df.rename(columns={'fractional distance from anterior': 'x', 
                        '[bcd] (a.u.)': 'I_bcd'})

Let's plot the data to see what we are dealing with.

In [11]:
# Plot experimental Bcd gradient.
plt.plot(df['x'], df['I_bcd'], marker='.', linestyle='none')

# Label axes (no units on x because it's dimensionless)
plt.xlabel('x')
plt.ylabel('I (a.u.)')
Out[11]:
<matplotlib.text.Text at 0x117479cf8>

Using scipy.optimize.curve_fit() to perform nonlinear regression

We will use scipy.optimize.curve_fit to preform the regression. Since this is our first time, let's read the doc string to figure out how to use it.

In [12]:
scipy.optimize.curve_fit?

We see that the function prototype is

scipy.optimize.curve_fit(f, xdata, ydata, p0=None)

and that f is the function we wish to use to fit our data. The function f must have a prototype f(t, *p), where the first parameter is the dependent variable and the remaining parameters are those to be determined by performing the regression.

Defining the model function

So, our first step for using scipy.optimize.curvefit() is to define the fit function f, which we will call bcd_gradient_model. We have three parameters, $I_0$, $a$, and $\lambda$.

In [13]:
def bcd_gradient_model_first_try(x, I_0, a, lam):
    """Model for Bcd gradient: exponential decay plus background"""
    return a + I_0 * np.exp(-x/lam)

Notice that we used lam as the name of the parameter $\lambda$. This is because lambda is a keyword in Python. Never name a variable lambda. This is so important, I'm going to make the point more fervently, even though we already made it in a previous lesson.

**Never** name a Python variable the same as a keyword!


To see what the keywords are, do this: `import keyword; print(keyword.kwlist)`

There is a problem with our function, though. It is really only defined is all of its arguments are positive. We therefore should assure that all arguments are positive.

In [14]:
def bcd_gradient_model(x, I_0, a, lam):
    """Model for Bcd gradient: exponential decay plus background"""
    
    assert np.all(np.array(x) >= 0), 'All values of x must be >= 0.'
    assert np.all(np.array([I_0, a, lam]) >= 0), 'All parameters must be >= 0.'
    
    return a + I_0 * np.exp(-x / lam)

Initial guess

With our fit function in place, we now need to supply initial guesses for the parameter values, given by the kwarg p0. (We don't have to do this, but scipy.optimize.curve_fit() will guess a value of 1 for all parameters, which is generally not a good idea. You should always explicitly supply your own initial guesses.) In looking at the plot, we see that we indeed have a nonzero background signal, somewhere around $a \approx 0.2$. We also see that $I_0 \approx 0.9$ and $\lambda \approx 0.3$. We would normally use these as our approximate guesses, but to show an additional lesson, we will guess $\lambda = 1$, making the solver do a little more work.

In [15]:
# Specify initial guess
I_0_guess = 0.9
a_guess = 0.2
lam_guess = 1.0

# Construct initial guess array
p0 = np.array([I_0_guess, a_guess, lam_guess])

Performing the regression (first try)

When doing the curve fit, we see that it returns the optimal parameter values as well as an estimate of the covariance matrix. For reasons I will not discuss here, this covariance matrix has some assumptions under the hood that are not always appropriate, so we will just ignore it. Now we're ready to do the regression!

In [16]:
# Do curve fit, but dump covariance into dummy variable
p, _ = scipy.optimize.curve_fit(bcd_gradient_model, df['x'], df['I_bcd'], p0=p0)

# Print the results
print("""
I_0 = {0:.2f}
  a = {1:.2f}
  λ = {2:.2f}
""".format(*tuple(p)))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-16-c59066ade0c2> in <module>()
      1 # Do curve fit, but dump covariance into dummy variable
----> 2 p, _ = scipy.optimize.curve_fit(bcd_gradient_model, df['x'], df['I_bcd'], p0=p0)
      3 
      4 # Print the results
      5 print("""

/Users/Justin/anaconda/lib/python3.5/site-packages/scipy/optimize/minpack.py in curve_fit(f, xdata, ydata, p0, sigma, absolute_sigma, check_finite, bounds, method, jac, **kwargs)
    674         # Remove full_output from kwargs, otherwise we're passing it in twice.
    675         return_full = kwargs.pop('full_output', False)
--> 676         res = leastsq(func, p0, Dfun=jac, full_output=1, **kwargs)
    677         popt, pcov, infodict, errmsg, ier = res
    678         cost = np.sum(infodict['fvec'] ** 2)

/Users/Justin/anaconda/lib/python3.5/site-packages/scipy/optimize/minpack.py in leastsq(func, x0, args, Dfun, full_output, col_deriv, ftol, xtol, gtol, maxfev, epsfcn, factor, diag)
    385             maxfev = 200*(n + 1)
    386         retval = _minpack._lmdif(func, x0, args, full_output, ftol, xtol,
--> 387                                  gtol, maxfev, epsfcn, factor, diag)
    388     else:
    389         if col_deriv:

/Users/Justin/anaconda/lib/python3.5/site-packages/scipy/optimize/minpack.py in func_wrapped(params)
    453     if weights is None:
    454         def func_wrapped(params):
--> 455             return func(xdata, *params) - ydata
    456     else:
    457         def func_wrapped(params):

<ipython-input-14-701a4019e1b2> in bcd_gradient_model(x, I_0, a, lam)
      3 
      4     assert np.all(np.array(x) >= 0), 'All values of x must be >= 0.'
----> 5     assert np.all(np.array([I_0, a, lam]) >= 0), 'All parameters must be >= 0.'
      6 
      7     return a + I_0 * np.exp(-x / lam)

AssertionError: All parameters must be >= 0.

Oh no! We got an exception that we had negative parameters! This occurred because under the hood, scipy.optimize.curve_fit() tries many sets of parameter values as it searches for those that bring the theoretical curve closest to the observed data. (It is way more complicated than that, but we won't get into that here.)

Performing the regression (second try)

At this point, we have a few options.

  1. Take out our error checking on positivity of parameter. The resulting curve with negative values of c or d will be so far off, the curve fit routine should come back toward physical parameter values after an excursion into non-physicality.
  2. Try using something other than scipy.optimize.curve_fit().
  3. Adjust our theoretical function by using the logarithm of the parameter values instead of the parameter values themselves. This ensures that all parameter are positive.

I think option 1 is off the table. We generally want to avoid "shoulds" when programming. We do not want to just hope that the solver will work, even though nonphysical parameter values are encountered.

Depending on how many times I will need to do a particular type of curve fit, I typically prefer option 2. We will not do this in the bootcamp, but I often use a technique called Markov chain Monte Carlo.

We'll use option 3. This still allows the solver to consider a smooth function of the parameter values. So, let's define a function that takes the logarithms of the parameters and input.

In [17]:
def bcd_gradient_model_log_params(x, log_I_0, log_a, log_lam):
    """
    Model for Bcd gradient: exponential decay plus 
    background with log parameters.
    """
    
    # Exponentiate parameters
    I_0, a, lam = np.exp(np.array([log_I_0, log_a, log_lam]))
    
    return bcd_gradient_model(x, I_0, a, lam)

Now let's try our curve fit again. We need to make sure we convert our initial guesses into logarithms. Remember, now the solver will be working with logarithms of parameters.

In [18]:
# Construct initial guess array
log_p0 = np.log(p0)

# Do curve fit, but dump covariance into dummy variable
log_p, _ = scipy.optimize.curve_fit(bcd_gradient_model_log_params, 
                                    df['x'], df['I_bcd'], p0=log_p0)

# Get the optimal parameter values
p = np.exp(log_p)

# Print the results
print("""
I_0 = {0:.2f}
  a = {1:.2f}
  λ = {2:.2f}
""".format(*tuple(p)))
I_0 = 0.77
  a = 0.17
  λ = 0.19

Plotting the result

We can now generate a smooth curve defined by the optimal parameters we just found and plot it along with the data. To do this, we make a densely sampled array of $x$ values and then compute the fitting function for these values.

In [20]:
# Smooth x values (100 values between zero and one)
x_smooth = np.linspace(0, 1, 100)

# Compute smooth curve
I_smooth = bcd_gradient_model(x_smooth, *tuple(p))

# Plot everything together
plt.plot(x_smooth, I_smooth, marker='None', linestyle='-', color='gray')
plt.plot(df['x'], df['I_bcd'], marker='.', linestyle='none')

# Label axes
plt.xlabel('$x$')
plt.ylabel('$I$ (a.u.)')
Out[20]:
<matplotlib.text.Text at 0x11f70e7b8>

So, we have determined the length scale of the Bcd gradient: about 20% of the total embryo length.