Exercise 4.5: Data collapse


Rob Phillips wrote a review paper that I enjoyed entitled “Napoleon is in Equilibrium.”. In the paper, he demonstrated that when you plot data in a certain way, they demonstrate data collapse. The idea here is that if you choose the right thing to plot on the \(x\) and \(y\) axes, data from a variety of sources collapse onto a single universal curve. In this exercise, you will hone your Pandas, NumPy, and Bokeh skills in making plots exhibiting data collapse.

This analysis comes from Rob’s paper, and the data come from Daber, Sochor, and Lewis, J. Mol. Biol., 409, 76–87, 2011. The authors were studying how different mutants of the lac repressor affect gene expression. They hooked the lac promoter up to a fluorescent protein reporter. They made a mutant with no lac repressor to get a measurement of the gene expression level (quantified by the fluorescent signal) in the absence of repressor. Then then looked at how the presence of a repressor served to decrease the expression level of the lac gene. The ratio of the repressed fluorescence to the totally unrepressed fluorescence is the fold change in repression. They can block repression by adding IPTG, which binds the lac repressor, rendering it ineffective at repressing gene expression (so IPTG is called an “inducer,” since it turns on gene expression). So, for a given experiment, the authors measured fold change as a function of IPTG concentration. They measured the fold change for wild type, plus two mutants, Q18M and Q18A.

We will not derive it here (it comes from a generalization of the Monod-Wyman-Changeux model), but the theoretical expression for the fold change as a function of IPTG concentration, \(c\), is

\begin{align} \text{fold change} = \left[1 + \frac{\frac{R}{K}\left(1 + c/K_\mathrm{d}^\mathrm{A}\right)^2}{\left(1 + c/K_\mathrm{d}^\mathrm{A}\right)^2 + K_\mathrm{switch}\left(1 + c/K_\mathrm{d}^\mathrm{I}\right)^2}\right]^{-1}. \end{align}

The parameters are:

Parameter

Description

Value

Units

\(K_\mathrm{d}^\mathrm{A}\)

dissoc. const. for active repressor binding IPTG

0.017

mM\(^{-1}\)

\(K_\mathrm{d}^\mathrm{I}\)

dissoc. const. for inactive repressor binding IPTG

0.002

mM\(^{-1}\)

\(K_\mathrm{switch}\)

equil. const. for switching active/inactive

5.8

\(K\)

dissoc. const. for active repressor binding operator

?

per cell

\(R\)

number of repressors in cell

?

The values of \(K_\mathrm{d}^\mathrm{A}\), \(K_\mathrm{d}^\mathrm{I}\), and \(K_\mathrm{switch}\) were measured in the Daber, Sochor, and Lewis paper, and, as I mentioned before, are the same for all mutants. You can see in the expression for the fold change that \(R\) and \(K\) always appear as a ratio, \(R/K\), so we can only determine this ratio, \(R/K\), for each mutant. They are, for the respective mutants:

Mutant

\(R/K\)

WT

141.5

Q18A

16.56

Q18M

1332

Now let’s get started with the analysis.

a) Load in the three data sets. They are in the files ~/git/data/wt_lac.csv, ~/git/data/q18m_lac.csv, and ~/git/data/q18a_lac.csv. You should put them in a single DataFrame with an added column for genotype. This can be accomplished, for example, with pd.concat().

b) Make a plot of fold change vs. IPTG concentration for each of the three mutants. Think: should any of the axes have a logarithmic scale?

c) Write a function with the signature fold_change(c, RK, KdA=0.017, KdI=0.002, Kswitch=5.8) to compute the theoretical fold change. It should allow c, the concentration of IPTG, to be passed in as a NumPy array or scalar, and RK, the \(R/K\) ratio, must be a scalar. Remember, with NumPy arrays, you don’t have to write for loops to do operations to each element of the array.

d) You will now plot a smooth curve showing the theoretical fold change for each mutant. >1. Make an array of closely spaced points for the IPTG concentration. Hint: The function np.logspace() will be useful. 2. Compute the theoretical fold change based on the given parameters using the function you wrote in part (c). 3. Plot the smooth curves on the same plot with the data.

e) If we look at the functional form of the fold change and at the parameters we are given, we see that only \(R/K\) varies from mutant to mutant. I told you this a priori, but we didn’t really know it. Daber, Sochol, and Lewis assumed that the binding to IPTG would be unaltered and the binding to DNA would be altered based on the position of the mutation in the lac repressor protein. Now, if this is true, then \(R/K\) should be the only thing that varies. We can check this by seeing if the data collapse onto a single curve. To see how this works, we define the Bohr parameter, \(F(c)\), as

\begin{align} F(c) = -\ln\left(R/K\right) - \ln\left(\frac{\left(1 + c/K_\mathrm{d}^\mathrm{A}\right)^2}{\left(1 + c/K_\mathrm{d}^\mathrm{A}\right)^2 + K_\mathrm{switch}\left(1 + c/K_\mathrm{d}^\mathrm{I}\right)^2}\right). \end{align}

The second term in the Bohr parameter is independent of the identity of the mutant, and the first term depends entirely upon it. Then, the fold change can be written as

\begin{align} \text{fold change} = \frac{1}{1 + \mathrm{e}^{-F(c)}}. \end{align}

So, if we make our \(x\)-axis be the Bohr parameter, all data should fall on the same curve. Hence the term, data collapse. (The Bohr parameter gets its name (as given by Rob Phillips) because it is inspired by the work of Christian Bohr (Niels’s father), who discovered similar families of curves describing binding of oxygen to hemoglobin.)

Now, we will plot the theoretical curve of fold change versus Bohr parameter.

  1. Write a function with call signature bohr_parameter(c, RK, KdA=0.017, KdI=0.002, Kswitch=5.8) that computes the Bohr parameter.

  2. Write a function with call signature fold_change_bohr(bohr_parameter) that gives the fold change as a function of the Bohr parameter.

  3. Generate values of the Bohr parameter ranging from \(-6\) to \(6\) in order to make a smooth plot.

  4. Compute the theoretical fold change as a function of the Bohr parameter and plot it as a gray line.

f) Now, for each experimental curve:

  1. Convert the IPTG concentration to a Bohr parameter using the given parameters.

  2. Plot the experimental fold change versus the Bohr parameter you just calculated. Plot the data as dots on the same plot that you made the universal gray curve, making sure to appropriately annotate your plot.

Do you see data collapse? Does it make sense that only operator binding is changing from mutant to mutant? And importantly, the collapse demonstrates that all of the mutants are behaving according to the Monod-Wyman-Changeux model, and the mutations affect quantitative, not qualitative, changes in the behavior of the repressor.

Solution

[1]:
import numpy as np
import pandas as pd

import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()
Loading BokehJS ...

a) After checking out the files, we see that there are two comment lines, starting the #. We’ll load in the data frames, add the genotype column, and then concatenate them to make a single data frame.

[2]:
df_wt = pd.read_csv('data/wt_lac.csv', comment='#')
df_q18a = pd.read_csv('data/q18a_lac.csv', comment='#')
df_q18m = pd.read_csv('data/q18m_lac.csv', comment='#')

# Add genotype column to DataFrames
df_wt['genotype'] = 'WT'
df_q18a['genotype'] = 'Q18A'
df_q18m['genotype'] = 'Q18M'

# Concatenate into a single DataFrame
df = pd.concat([df_wt, df_q18a, df_q18m], ignore_index=False)

# Take a look
df.head()
[2]:
[IPTG] (mM) fold change genotype
0 0.000010 0.040697 WT
1 0.000974 0.072788 WT
2 0.002044 0.105284 WT
3 0.004049 0.199477 WT
4 0.007584 0.343024 WT

b) The IPTG concentration varies over five or six orders of magnitude, so we should have the IPTG (x) axis be on a logarithmic scale. The fold change is all within one order of magnitude, so we keep the y-axis on a linear scale.

[3]:
# Set up figure
p = bokeh.plotting.figure(
    frame_height=300,
    frame_width=450,
    x_axis_label='[IPTG] (mM)',
    y_axis_label='fold change',
    x_axis_type='log',
)

# Color palette
palette = bokeh.palettes.d3['Category10'][10]

# Populate glyphs, looping through grouped dataframe
for i, (name, group) in enumerate(df.groupby('genotype')):
    p.circle(
        source=group,
        x='[IPTG] (mM)',
        y='fold change',
        color=palette[i],
        legend_label=name
    )

# Set legend properties
p.legend.location = 'top_left'
p.legend.click_policy = 'hide'

bokeh.io.show(p)

c) We use the convenience of NumPy’s broadcasting. We can almost just type it out as if we were computing it for a single value of \(c\).

[4]:
# Fold change function
def fold_change(c, RK, KdA=0.017, KdI=0.002, Kswitch=5.8):
    """Compute theoretical fold change for MWC model."""
    # Inverse fold change
    inv_fc = 1 + (1 + c/KdA)**2 * RK / ((1 + c/KdA)**2 + Kswitch*(1 + c/KdI)**2)

    return 1 / inv_fc

d) First, we need to make our theoretical curves and plot them.

[5]:
# Theoretical concentration
c = np.logspace(-6, 2, 200)

# Dictionary or parameter values
RK = dict(
    WT=141.5,
    Q18A=16.56,
    Q18M=1332,
)

# Make smooth curves and plot them
for i, (name, _) in enumerate(df.groupby('genotype')):
    p.line(
        x=c,
        y=fold_change(c, RK[name]),
        color=palette[i],
        line_width=2
    )

bokeh.io.show(p)

e) We’ll write a function to compute the Bohr parameter.

[6]:
def bohr_parameter(c, RK, KdA=0.017, KdI=0.002, Kswitch=5.8):
    """Compute Bohr parameter based on MWC model."""
    # Big nasty argument of logarithm
    log_arg = (1 + c/KdA)**2 / ((1 + c/KdA)**2 + Kswitch*(1 + c/KdI)**2)

    return -np.log(RK) - np.log(log_arg)

Next, we’ll make a plot of the fold change as a function of the Bohr parameter.

[7]:
# Compute smooth function for fold change
bohr = np.linspace(-6, 6, 400)
fold_change_bohr = 1 / (1 + np.exp(-bohr))

p = bokeh.plotting.figure(
    frame_height=300,
    frame_width=450,
    x_axis_label='Bohr parameter',
    y_axis_label='fold change',
    x_range=[-6, 6],
)

p.line(
    x=bohr,
    y=fold_change_bohr,
    line_color='gray',
    line_width=2,
)

bokeh.io.show(p)

f) Adding the experimental fold changes to the plot is now easy. We compute the Bohr parameter from the concentration (and the rest of the parameters) to give the x values and we use the experimentally determined values for the y values.

[8]:
for i, (name, group) in enumerate(df.groupby('genotype')):
    p.circle(
        x=bohr_parameter(group['[IPTG] (mM)'], RK[name]),
        y=group['fold change'],
        color=palette[i],
        legend_label=name,
    )

p.legend.location = 'bottom_right'

bokeh.io.show(p)

When plotted this way, all data fall on the same curve, clearly demonstrating data collapse. This suggests that indeed the only thing varying between the different mutants is the parameter \(R/K\), the number of repressors divided by the repressor-operator dissociation constant.

Computing environment

[9]:
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,jupyterlab
Python implementation: CPython
Python version       : 3.9.12
IPython version      : 8.3.0

numpy     : 1.21.5
pandas    : 1.4.2
bokeh     : 2.4.2
jupyterlab: 3.3.2