Lesson 24: Practice with Pandas and Bokeh solution¶

(c) 2019 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This lesson was generated from a Jupyter notebook. You can download the notebook here.

import pandas as pd

import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()

Practice 1: Axes with logarithmic scale¶

Sometimes you need to plot your data with a logarithmic scale. As an example, let's consider the classic genetic switch engineered by Jim Collins and coworkers (Gardner, et al., Nature, 403, 339, 2000). This genetic switch was incorporated into E. coli and is inducible by adjusting the concentration of the lactose analog IPTG. The readout is the fluorescence intensity of GFP.

Let's load in some data that have the IPTG concentrations and GFP fluorescence intensity. The data are in the file ~/git/data/collins_switch.csv. Let's look at it.

!cat data/collins_switch.csv

# Data digitized from Fig. 5a of Gardner, et al., *Nature*, **403**, 339, 2000. The last column gives the standard error of the mean normalized GFP intensity.
[IPTG] (mM),normalized GFP expression (a.u.),sem
0.001000,0.004090,0.003475
0.010000,0.010225,0.002268
0.020000,0.022495,0.004781
0.030000,0.034765,0.003000
0.040000,0.067485,0.006604
0.040000,0.668712,0.087862
0.060000,0.740286,0.045853
0.100000,0.840491,0.058986
0.300000,0.936605,0.026931
0.600000,0.961145,0.093553
1.000000,0.940695,0.037624
3.000000,0.852761,0.059035
6.000000,0.910020,0.051052
10.000000,0.893661,0.042773

It has two rows of non-data. Then, Column 1 is the IPTG concentration, column 2 is the normalized GFP expression level, and the last column is the standard error of the mean normalized GFP intensity. This gives the error bars, which we will look at in the next exercise. For now, we will just plot IPTG versus normalized GFP intensity.

In looking at the data set, note that there are two entries for [IPTG] = 0.04 mM. At this concentration, the switch happens, and there are two populations of cells, one with high expression of GFP and one with low. The two data points represent these two populations of cells.

Now, let's make a plot of IPTG versus GFP.

Load in the data set using Pandas. Make sure you use the comment kwarg of pd.read_csv() properly.

Make a plot of normalized GFP intensity (y-axis) versus IPTG concentration (x-axis).

Now that you have done that, there are some problems with the plot. It is really hard to see the data points with low concentrations of IPTG. In fact, looking at the data set, the concentration of IPTG varies over four orders of magnitude. When you have data like this, it is wise to plot them on a logarithmic scale. You can specify the x-axis as logarithmic when you instantiate a figure with bokeh.plotting.figure() by using the x_axis_type='log' kwarg. (The obvious analogous kwarg applied for the y-axis.) For this data set, it is definitely best to have the x-axis on a logarithmic scale. Remake the plot you just did with the x-axis logarithmically scaled.

When you make the x-axis logarithmically scaled, you will notice the Bokeh's formatting for the tick labels is pretty awful. Fixing this is a surprisingly difficult problem, and many plotting packages do not make pretty superscripts.

Practice 1: solution¶

# Load in the data
df = pd.read_csv('data/collins_switch.csv', comment='#')

# For convenience
x = '[IPTG] (mM)'
y = 'normalized GFP expression (a.u.)'

# Make the plot
p = bokeh.plotting.figure(
    height=300,
    width=450,
    x_axis_label=x,
    y_axis_label=y,
)

p.circle(
    source=df,
    x=x,
    y=y
)

bokeh.io.show(p)

We clearly need the $x$-axis to be on a log scale, so let's remake the plot.

# Make the plot
p = bokeh.plotting.figure(
    height=300,
    width=450,
    x_axis_label=x,
    y_axis_label=y,
    x_axis_type='log',
)

p.circle(
    source=df,
    x=x,
    y=y,
)

bokeh.io.show(p)

Practice 2: Plots with error bars¶

The data set also contains the standard error of the mean, or SEM. The SEM is often displayed on plots as error bars. Now construct the plot with error bars.

Add columns error_low and error_high to the DataFrame containing the Collins data. These will set the bottoms and tops of the error bars. You should base the values in these columns on the standard error of the mean (sem). Assuming a Gaussian model, the 95% confidence interval is ±1.96 times the s.e.m.

Make a plot with the measured expression levels and the error bars. Hint: Check out the Bokeh docs and think about what kind of glyph works best for error bars.

Practice 2 solution¶

# Add error bars to the DataFrame
df['error_low'] = df['normalized GFP expression (a.u.)'] - 1.96*df['sem']
df['error_high'] = df['normalized GFP expression (a.u.)'] + 1.96*df['sem']

# Add error bars
p.segment(
    source=df,
    x0=x,
    y0='error_low',
    x1=x,
    y1='error_high',
    line_width=2
)

bokeh.io.show(p)

Using the interactivity, you can zoom in on the point with the smalled IPTG concentration. Note the value of interactive graphics here. Even though the error bar is smaller than the dot on the zoomed-out plot, it is resolvable upon zooming in, which is impossible even with vector graphics. Note also a common problem with using an s.e.m. to compute error bars. The result is something unphysical; the error bar extends below zero. In general using s.e.m. for error bars is a not a good idea.

Computing environment¶

%load_ext watermark
%watermark -v -p pandas,bokeh,jupyterlab

CPython 3.7.3
IPython 7.1.1

pandas 0.24.2
bokeh 1.2.0
jupyterlab 0.35.5