(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This tutorial was generated from a Jupyter notebook. You can download the notebook here.
import numpy as np
import scipy.stats
Sometimes you need to plot your data with a logarithmic scale. As an example, let's consider the classic genetic switch engineered by Jim Collins and coworkers (Gardner, et al., Nature, 403, 339, 2000). This genetic switch was incorporated into E. coli and is inducible by adjusting the concentration of the lactose analog IPTG. The readout is the fluorescence intensity of GFP.
Let's load in some data that have the IPTG concentrations and GFP fluorescence intensity. The data are in the file ~/git/data/collins_switch.csv
. Let's look at it.
!cat data/collins_switch.csv
It has two rows of non-data. Then, Column 1 is the IPTG concentration, column 2 is the normalized GFP expression level, and the last column is the standard error of the mean normalized GFP intensity. This gives the error bars, which we will look at in the next exercise. For now, we will just plot IPTG versus normalized GFP intensity.
In looking at the data set, note that there are two entries for [IPTG] = 0.04 mM. At this concentration, the switch happens, and there are two populations of cells, one with high expression of GFP and one with low. The two data points represent these two populations of cells.
Now, let's make a plot of IPTG versus GFP.
- Load in the data set using
np.loadtxt()
. Be sure to use thedelimeter=','
andskiprows=2
kwargs.- Slice column 0 out of the data and store it as
iptg
.- Slice column 1 out of the data and store it as
gfp
.- Make a plot of normalized GFP intensity ($y$-axis) versus IPTG concentration ($x$-axis) using Matplotlib's
plot()
function. These are data, so you should not make the plot as lines. Use themarker='.'
andlinestyle='none'
kwargs.- Label your axes.
Now that you have done that, there are some problems with the plot. It is really hard to see the data points with low concentrations of IPTG. In fact, looking at the data set, the concentration of IPTG varies over four orders of magnitude. When you have data like this, it is wise to plot them on a logarithmic scale. You can set the scale to be logarithmic using the set_xscale()
method.
ax.set_xscale('log')
For this data set, it is best to have the $x$ axis on a logarithmic scale. Remake the plot you just did with a logarithmically scaled $x$-axis.
The data set also contains the standard error of the mean, or SEM. The SEM is often displayed on plots as error bars. To make a plot with error bars, you can use the plt.errorbar()
function.
- Read the documentation of
plt.errorbar()
.- Slice column 2 out of the Collins data set and store it as
sem
.- Make a plot of the genetic switch data using
ax.errorbar()
setting the kwargyerr=sem
.- Label your axes.
There is also a problem with the GFP signals at low IPTG. The error bars are tiny, so it is hard to see the symbol. Now, play with different kwargs to make your error bar plot look the way you like. I recommend these kwargs:
linestyle='none'
marker='.'
markersize=10
We plotted the measured cross-sectional areas of the C. elegans eggs from the Harvey and Orbidans paper as histograms. A histogram is a way of approximately representing the probability distribution function, or PDF, describing the data. The cumulative distribution function, or CDF, contains the same information as the PDF. It's just its integral. Importantly, we can plot the data to show what the CDF looks like, the so-called empirical cumulative distribution function, or ECDF, without the binning bias inherent in histograms.
To plot an ECDF, the $x$-values are the sorted values of the array of data. The values of the $y$ axis are $y_j = (j+1) / n$, where $n$ is the number of data points and $0 \le j \le n-1$.
- Write a function to compute the ECDF. The call signature is
ecdf(data)
. It returns thex
andy
values needed to plot the ECDF.
- Compute
x
by sorting the arraydata
.- Compute
y
. Think carefully about how to do this. The two functions you need arenp.arange()
andlen()
.- Return
x,y
.- Load in the data sets
xa_high_food.csv
andxa_low_food.csv
.- Generate
x
andy
values for the ECDFs for these data sets.- Plot the ECDFs got the high food and low food data as dots. That is, use the kwargs
marker='.'
andlinestyle='none'
when you callax.plot()
. Note that to add additional plots to a given figure, you just callax.plot()
again. Be sure to label your axes. "ECDF" is a reasonable label for your $y$-axis.
I think far too few papers use ECDFs in displaying data. They are far better then histograms. I hope that now you have found them, you will use them in your own research.
Speaking of your own work, hopefully what you will use some of the code you write here in bootcamp going forward. As such, set up a module, bootcamp_utils
that contains some nice functions you write as you go through the bootcamp. We'll start with ecdf()
.
- Use Atom to open the file
~/git/bootcamp/bootcamp_utils.py
.- Put an appropriate doc string at the top.
- Import Numpy at the top. Later on, as you add more functions, you might need to include more imports.
- Put your code for your
ecdf()
function in that file.- Save and close the file.
- Use
git add
to putbootcamp_utils.py
under version control.- Use
git commit
to commit it. Don't forget the-m
flag and your commit message.- Use
git push origin master
to push your changes to the repository. Remember, you are working on your own fork, so there will be no conflicts with the upstream repository or with your classmates.
We might be interested to see if the egg cross-section data follow a Normal distribution. After all, this is commonly an underlying assumption when people report data from repeated measurements in the literature.
One way to assess this is to plot the theoretical CDF with the same mean and standard deviation as the data on top of the ECDFs. (There are better graphical ways to do this, but this is ok for our purposes here.) We know the cumulative distribution function for a Normal distribution with mean $\mu$ and standard deviation $\sigma$ is
\begin{align} \mathrm{cdf}(x) = \frac{1}{2}\left(1 + \mathrm{erf}\left(\frac{x - \mu}{\sqrt{2\sigma^2}}\right)\right), \end{align}but instead of coding this up directly, we can use the scipy.stats
to do it for us! We just need to supply where we want it evaluated ($x$), and the mean (the location parameter) and standard deviation (the scale parameter). Something like this:
# Make smooth x-values
x = np.linspace(1600, 2500, 400)
# Compute theoretical Normal distribution
cdf_theor = scipy.stats.norm.cdf(x, loc=np.mean(xa_low), scale=np.std(xa_low))
Now, let's make the plot.
- Regenerate your ECDFs. This time, import the
bootcamp_utils
module and callbootcamp_utils.ecdf()
to get yourx
andy
values.- Make smooth curves of the Normal CDF using
scipy.stats.norm.cdf()
.- Plot those curves using
ax.plot()
. You can use thecolor='gray'
keyword argument to set the color of the smooth curves. Note that we plot the smooth curves first, and then the ECDFs (which are the raw data) so that the smooth curves do not obscure the data.- Plot the ECDFs as dots, as before. Use the
label
kwarg to name these ECDFs, e.g., withlabel='high food'
.- Make a legend. If any graphic object created with
ax.plot()
was made with alabel
kwarg, only those that were so made show up in the legend.