(c) 2019 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This lesson was generated from a Jupyter notebook. You can download the notebook here.
import os
import glob
import numpy as np
import pandas as pd
import skimage.io
import skimage.morphology
import skimage.segmentation
import skimage.measure
import bokeh_catplot
import bootcamp_utils
import bokeh.io
bokeh.io.output_notebook()
Do any of the previous exercises you have not completed and are interested in working on.
At the dawn of the molecular revolution in biology, key experiments by Jacques Monod in which he measured growth curves of bacteria under different conditions exposed some of the mechanisms of regulation of gene expression. Those growth curves were measured in a bulk solution. In this exercise, we will measure bacterial growth starting from two bacteria. The movie shows Bacillus subtilis constitutively expressing mCherry growing under slow growth conditions. These data were kindly donated by Jin Park from the Elowitz lab at Caltech.
a) Load in the series of images contained in the directory ~git/bootcamp/data/bacterial_growth/
. Be sure that however you store them (a list or tuple or other object) has the frames in the proper order.
b) Segment the images to separate bacteria from background. You do not need to segment individual bacteria; this would likely require some more advanced techniques involving edge detection that we haven't covered in bootcamp.
c) Show a representative image from the stack (with the segmentation overlayed) of images. Be sure to check out the README
file in the directory containing the images to get the interpixel distance.
d) If $m$ is the mass of bacteria, which is proportional to the area in the images, then, for exponential growth,
\begin{align} m(t) = m_0\,\mathrm{e}^{rt}, \end{align}where $r$ is the growth rate. Taking the logarithm of both sides gives
\begin{align} \ln m(t) = \ln m_0 + rt. \end{align}So, the slope of the line on a log-log scale is the growth rate.
With this in mind, plot the bacterial area as a function of time with the area on a log scale. If the plot is linear, the bacteria are performing exponential growth.
e) To get the slope of the line, you can use the np.polyfit()
function and perform the regression on a semilog scale. Alternatively, you can fit the exponential function directly using scipy.optimize.curve_fit()
. Once you find the growth rate (and intercept), plot the theoretical growth curve along with the data.
a) We will load the images into a tuple, where the index of the tuple corresponds to the frame number. Under normal circumstances, we would do data validation to make sure there are no skipped frames, but in the interest of brevity, we will proceed assuming all data are properly labeled and organized.
# Use a list comprehension to read in the images
ims = [skimage.io.imread(fname)
for fname in sorted(glob.glob('data/bacterial_growth/bacillus_*.tif'))]
# Store it as a tuple so we don't mess with it
ims = tuple(ims)
# How many?
print('There are', len(ims), 'images.')
Let's take a quick look at a few of the images. We'll look at frames 0
, 15
, 30
, and 45
. For now, we will not worry about setting the scales of the axes.
plots = [bootcamp_utils.bokeh_imshow(ims[i], plot_height=300) for i in [0, 15, 30, 45]]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))
b) We will take a similar approach as in Lesson 35 to segment the image. We will simply threshold the image using Otsu's method.
# Threshold each image
ims_bw = [im > skimage.filters.threshold_otsu(im) for im in ims]
plots = [
bootcamp_utils.bokeh_imshow(ims_bw[i], plot_height=300) for i in [0, 15, 30, 45]
]
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))
This is pretty good. Certainly good enough for generating a growth curve.
c) We will overlay the segmentation on image 40. We'll use the cyan channel. In looking at the README file associated with the images, the interpixel distance is 64.5 nm.
def display_overlay(im, im_bw):
"""Create RGB image with cyan channel saturated via segmentation mask."""
# Make float and normalized image
im = np.copy(im).astype(float)
# Build RGB image by stacking grayscale images
im_rgb = np.dstack(3 * [im / im.max()])
# Saturate cyan channel wherever there are white pixels in thresh image
im_rgb[im_bw, 0] = 1.0
return im_rgb
# Show the result
im_rgb = display_overlay(ims[40], ims_bw[40])
p = bootcamp_utils.bokeh_imshow(im_rgb,
interpixel_distance=0.0645,
length_units='µm')
bokeh.io.show(p)
The segmentation overshoots the bacteria. We might want to try to refine this a bit. We could do adaptive thresholding. We perform automatic thresholding on subimages throughout the image. We will take subimages to be 51 $\times$ 51, since that would span the width of one bacterium, and then some, thereby avoiding the issue of thresholding within a single bacterium (and adaptive thresholding needs an odd image block size). Adaptive thresholding is achieved using skimage.filters.threshold_local()
. Let's give it a shot.
# List of thresholded images
ims_bw = [im > skimage.filters.threshold_local(im, 51) for im in ims]
# Show the result
im_rgb = display_overlay(ims[40], ims_bw[40])
p = bootcamp_utils.bokeh_imshow(im_rgb,
interpixel_distance=0.0645,
length_units='µm')
bokeh.io.show(p)
Whoa! It did a great job with the bacteria, but the background got really messed up. We could include pixels that are unity in both the adaptive and Otsu thresholding to take care of this. We then use np.logical_and()
to do an element-by-element AND
operation on the arrays.
def thresh(im, block_size=51):
"""Combined adaptive and Otsu thresholding."""
thresh_otsu = skimage.filters.threshold_otsu(im)
im_bw = im > thresh_otsu
return np.logical_and(im > skimage.filters.threshold_local(im, block_size), im_bw)
# Threshold images
ims_bw = [thresh(im) for im in ims]
# Show the result
im_rgb = display_overlay(ims[40], ims_bw[40])
p = bootcamp_utils.bokeh_imshow(im_rgb,
interpixel_distance=0.0645,
length_units='µm')
bokeh.io.show(p)
Very nice!
d) To plot the growth curve, we can plot the total bacterial area on the $y$-axis (on a log scale), versus time on the $x$-axis. To get the total area, we need to compute the total number of "bacterial" pixels in a given image, multiplied by the pixel area, which is $(64.5\text{ nm})^2 = 4160.25\text{ nm}^2$. Also, according to the metadata, we have one frame every 15 minutes.
# Compute pixel area
pixel_area = 0.0645**2
# Get total bacterial area
bac_area = pixel_area * np.array([np.sum(im_bw) for im_bw in ims_bw])
# Get time in units of hours
t = 0.25 * np.arange(len(ims_bw))
# Plot the result
p = bokeh.plotting.figure(
height=250,
width=400,
y_axis_type='log',
x_axis_label='time (hours)',
y_axis_label='bacterial area (µm²)'
)
p.circle(
x=t,
y=bac_area
)
bokeh.io.show(p)
This looks pretty linear, suggesting that the bacteria are, indeed, performing exponential growth.
e) To perform the regression on the growth curve, we use np.polyfit()
with the x
values being time, and the y
values being the natural log of the bacterial area. We fit with a first order polynomial, which is a line.
# Use polyfit to get slope and intercept
slope, intercept = np.polyfit(t, np.log(bac_area), 1)
# Pull out and print parameters
print("""
m0 = {0:.2f} sq. µm
r = {1:.2f} 1/hours
""".format(np.exp(intercept), slope))
# Generate smooth curve
t_smooth = np.linspace(0, t.max(), 200)
y_smooth = np.exp(intercept + slope * t_smooth)
# Add to the plot
p.line(
x=t_smooth,
y=y_smooth,
color='orange',
line_width=2,
)
bokeh.io.show(p)
There is some systematic error, it appears, with the data being consistently above the curve at short times and then below for intermediate times. This could be due to thresholding, or due to small changes in conditions over time.
So far we have seen that in a single (very clean) image, we can get somewhere around 20 - 30 well-separated cells in a single 100$\times$ magnification phase contrast image. However, if you wish to report a mean fluorescence intensity for a single strain, you would certainly want more cells to have a good degree of confidence. Using the principles you learned above, your job will be to report a mean fluorescence value for the HG105 E. coli strain using all of the images located in data/HG105_images/
. To do this, you should do the following:
- Get a list of all of the image files in
data/HG105_images/
.- Separate them by phase contrast (for segmentation) and FITC (for measurement).
- Iterate through each image file and perform segmentation and fluorescence intensity extraction for each cell. These values should be stored in a NumPy array or Pandas
DataFrame
.- Plot an ECDF of all extracted fluorescence intensities and report a mean and standard deviation as well as the number of cells you successfully measured.
- Obtain 95% bootstrap confidence intervals for the mean and standard deviation of the fluorescence intensities.
As a reminder, the interpixel distance of these images is 0.0626 µm per pixel.
We already wrote cell segmentation functions in our image processing practice session. The call signature is
cell_segmenter(im, thresh='otsu', radius=20.0, image_mode='phase',
area_bounds=(0, 1e7), ecc_bounds=(0, 1))
We will include that code here so it is available. In practice, you would have this code in a module that you would have installed on your machine.
def cell_segmenter(
im,
thresh="otsu",
radius=20.0,
image_mode="phase",
area_bounds=(0, 1e7),
ecc_bounds=(0, 1),
):
"""
This function segments a given image via thresholding and returns
a labeled segmentation mask.
Parameters
----------
im : 2d-array
Image to be segmented. This may be of either float or integer
data type.
thresh : int, float, or 'otsu'
Value used during thresholding operation. This can either be a value
(`int` or `float`) or 'otsu'. If 'otsu', the threshold value will be
determined automatically using Otsu's thresholding method.
radius : float
Radius for gaussian blur for background subtraction. Default value
is 20.
image_mode : 'phase' or 'fluorescence'
Mode of microscopy used to capture the image. If 'phase', objects with
intensity values *lower* than the provided threshold will be selected.
If `fluorescence`, values *greater* than the provided threshold will be
selected. Default value is 'phase'.
area_bounds : tuple of ints.
Range of areas of acceptable objects. This should be provided in units
of square pixels.
ecc_bounds : tuple of floats
Range of eccentricity values of acceptable objects. These values should
range between 0.0 and 1.0.
Returns
-------
im_labeled : 2d-array, int
Labeled segmentation mask.
"""
# Apply a median filter to remove hot pixels.
med_selem = skimage.morphology.square(3)
im_filt = skimage.filters.median(im, selem=med_selem)
# Perform gaussian subtraction
im_sub = bg_subtract(im_filt, radius)
# Determine the thresholding method.
if thresh is "otsu":
thresh = skimage.filters.threshold_otsu(im_sub)
# Determine the image mode and apply threshold.
if image_mode is "phase":
im_thresh = im_sub < thresh
elif image_mode is "fluorescence":
im_thresh = im_sub > thresh
else:
raise ValueError(
"image mode not recognized. Must be 'phase'" + " or 'fluorescence'"
)
# Label the objects.
im_label = skimage.measure.label(im_thresh)
# Apply the area and eccentricity bounds.
im_filt = area_ecc_filter(im_label, area_bounds, ecc_bounds)
# Remove objects touching the border.
im_border = skimage.segmentation.clear_border(im_filt, buffer_size=5)
# Relabel the image.
im_border = im_border > 0
return skimage.measure.label(im_border)
def bg_subtract(im, radius):
"""
Subtracts a gaussian blurred image from itself smoothing uneven
illumination.
Parameters
----------
im : 2d-array
Image to be subtracted
radius : int or float
Radius of gaussian blur
Returns
-------
im_sub : 2d-array, float
Background subtracted image.
"""
# Apply the gaussian filter.
im_filt = skimage.filters.gaussian(im, radius)
# Ensure the original image is a float
if np.max(im) > 1.0:
im = skimage.img_as_float(im)
return im - im_filt
def area_ecc_filter(im, area_bounds, ecc_bounds):
"""
Filters objects in an image based on their areas.
Parameters
----------
im : 2d-array, int
Labeled segmentation mask to be filtered.
area_bounds : tuple of ints
Range of areas in which acceptable objects exist. This should be
provided in units of square pixels.
ecc_bounds : tuple of floats
Range of eccentricities in which acceptable objects exist. This should be
provided on the range of 0 to 1.0.
Returns
-------
im_relab : 2d-array, int
The relabeled, filtered image.
"""
# Extract the region props of the objects.
props = skimage.measure.regionprops(im, coordinates="rc")
# Extract the areas and labels.
areas = np.array([prop.area for prop in props])
eccs = np.array([prop.eccentricity for prop in props])
labels = np.array([prop.label for prop in props])
# Make an empty image to add the approved cells.
im_approved = np.zeros_like(im)
# Threshold the objects based on area and eccentricity
for i, _ in enumerate(areas):
if (
areas[i] > area_bounds[0]
and areas[i] < area_bounds[1]
and eccs[i] > ecc_bounds[0]
and eccs[i] < ecc_bounds[1]
):
im_approved += im == labels[i]
# Relabel the image.
return skimage.measure.label(im_approved > 0)
Now we need to figure out some way in which we can iterate over all of our images in data/HG105_images
. We have a few options here. First, and this is the worst solution, we could type out the name of each individual file and load them individually. The other option (a better option) is to make our computer get the list of all files in a single (or a few) lines of code. To do this, we will use the glob
module. glob
has a bunch of methods for obtaining filenames from directories. The most useful method of glob
is glob
which will return a list of paths that match a given pattern. Let's see how well it works for our needs.
# Glob the phase and fluo globs.
phase_glob = list(sorted(glob.glob('data/HG105_images/*phase*.tif')))
fluo_glob = list(sorted(glob.glob('data/HG105_images/*FITC*.tif')))
# Take a look.
print(phase_glob)
print(fluo_glob)
Well, that was easy. We used the wildcard character (*
) to find all files that had the pattern anything
then the strings phase
or FITC
followed by anything so long that it had a .tif
extension. The output is a little bit different as well in that it not only returned the name of the file in the specified data_dir
, but it gave us the entire relative path. This is an important distinction and is actually one of the reasons I frequently use glob.glob
.
Now that we have a list of files, we should iterate over each phase image, perform the segmentation, then extract the mean pixel intensities of each object and store them in a list. (We begin the cell with %%capture
to suppress excessive warnings.)
%%capture
# Instantiate an empty list for the mean pixel intensity of each cell.
mean_ints = []
# Do the same for the areas.
areas = []
# Image index
image_numbers = []
# Define our area and eccentricity bounds for the segmentation function.
ip_dist = 0.0626 # in units of µm per pixel
area_bounds = (0.5/ip_dist**2, 4/ip_dist**2)
ecc_bounds = (0.8, 1.0)
# Loop through all images.
for i, pf in enumerate(zip(phase_glob, fluo_glob)):
p, f = pf
# Load the phase image.
phase_im = skimage.io.imread(p)
# Perform the segmentation.
phase_seg = cell_segmenter(phase_im,
image_mode='phase',
area_bounds=area_bounds,
ecc_bounds=ecc_bounds)
# Load the fluorescence image.
fluo_im = skimage.io.imread(f)
# Compute the region properties.
props = skimage.measure.regionprops(phase_seg, intensity_image=fluo_im)
# Add them to our storage lists.
for prop in props:
mean_ints.append(prop.mean_intensity)
areas.append(prop.area * ip_dist**2)
image_numbers.append(i)
# Units don't matter, so normalize integrated intensity
mean_ints = np.array(mean_ints)
mean_ints /= mean_ints.max()
# Store results in data frame
df = pd.DataFrame({'mean intensity': mean_ints/max(mean_ints),
'area': areas,
'image number': image_numbers})
So, in just a few seconds, were were able to do what would take hours (if not a full day) to do in ImageJ by clicking. To make sure things are working as expected, let's take a look at the last segmentation mask generated. Remember that the array phase_seg
is the labeled image, so to get the binary image, we use phase_seg > 0
.
im_rgb = display_overlay(phase_im, phase_seg > 0)
p = bootcamp_utils.bokeh_imshow(im_rgb, interpixel_distance=ip_dist, length_units='µm')
bokeh.io.show(p)
That looks pretty great to me, let's see how many cells we were actually able to measure.
# Print the total number of cells.
print("Segmented and analyzed {num} cells!".format(num=len(df)))
Analyzing 92 cells in only 8 images in about 8 seconds is actually pretty good! That's certainly enough to do some serious boot strapping. Let's look at the ECDFs of the mean intensities and the areas.
p_mean_int = bokeh_catplot.ecdf(
data=df,
cats=None,
val='mean intensity',
conf_int=True,
height=200,
width=300,
x_axis_label='mean intensity (a.u.)'
)
p_area = bokeh_catplot.ecdf(
data=df,
cats=None,
val='area',
conf_int=True,
height=200,
width=300,
x_axis_label='area (µm²)'
)
bokeh.io.show(
bokeh.layouts.gridplot([p_mean_int, p_area], ncols=2)
)
The mean and standard deviation of the mean intensities (in arbitrary units) are directly computed.
df['mean intensity'].agg(['mean', 'std'])
Let's go ahead and perform 100,000 in silico replicates of this experiment to get a confidence interval on our mean and standard deviation of the mean fluorescence intensities.
def draw_bs_reps(data, func, size=1):
"""Draw bootstrap replicates"""
return np.array([func(np.random.choice(data, size=len(data))) for _ in range(size)])
# Draw bootstrap replicates of mean
bs_reps_mean = draw_bs_reps(mean_ints, func=np.mean, size=100000)
# Draw bootstrap replicates of standard deviation
bs_reps_std = draw_bs_reps(mean_ints, func=np.std, size=100000)
# Compute the 97.5% and 2.5% percentiles.
percs_mean = np.percentile(bs_reps_mean, [2.5, 97.5])
percs_std = np.percentile(bs_reps_std, [2.5, 97.5])
print('95% of our bootstrapped means lie between {0:.2f} and {1:.2f}.'.format(
percs_mean[0], percs_mean[1]))
print('95% of our bootstrapped st. devs. lie between {0:.2f} and {1:.2f}.'.format(
percs_std[0], percs_std[1]))
%load_ext watermark
%watermark -v -p numpy,pandas,skimage,bokeh,bokeh_catplot,bootcamp_utils,jupyterlab