(c) 2019 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This lesson was generated from a Jupyter notebook. You can download the notebook here.
import numpy as np
import scipy.stats
import pandas as pd
import bokeh_catplot
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
Numpy arrays can take a while to get the hang of. Therefore, it's important to practice practice practice!
In the last lesson, we looked at a data set from Harvey and Orbidans on the cross-sectional area of C. elegans eggs. Recall, we loaded the data and converted everything to Numpy arrays like this:
df = pd.read_csv('data/c_elegans_egg_xa.csv', comment='#')
xa_high = df.loc[df['food']=='high', 'area (sq. um)'].values
xa_low = df.loc[df['food']=='low', 'area (sq. um)'].values
Now we would like to compute the diameter of the egg from the cross-sectional area. Write a function that takes in an array of cross-sectional areas and returns an array of diameters. Recall that the diameter $d$ and cross-sectional area $A$ are related by $A = \pi d^2/4$. There should be no for
loops in your function! The call signature is
xa_to_diameter(xa)
Use your function to compute the diameters of the eggs.
def xa_to_diameter(xa):
"""
Convert an array of cross-sectional areas
to diameters with commensurate units.
"""
# Compute diameter from area
diameter = 2 * np.sqrt(xa / np.pi)
return diameter
print('Diameters of eggs from well fed mothers:\n', xa_to_diameter(xa_high))
print('\nDiameters of eggs from poorly fed mothers:\n', xa_to_diameter(xa_low))
Numpy enables you do to matrix calculations on two-dimensional arrays. In exercise, you will practice doing matrix calculations on arrays. We'll start by making a matrix and a vector to practice with. You can copy and paste the code below.
A = np.array([[6.7, 1.3, 0.6, 0.7],
[0.1, 5.5, 0.4, 2.4],
[1.1, 0.8, 4.5, 1.7],
[0.0, 1.5, 3.4, 7.5]])
b = np.array([1.1, 2.3, 3.3, 3.9])
a) First, let's practice slicing.
A
.A
.A
that is greater than 2.A
. using the np.diag()
function.# 1.
print(A[1,:])
# 2.
print(A[:,[1,3]])
# 3.
print(A[A>2])
# 4.
print(np.diag(A))
b) The np.linalg
module has some powerful linear algebra tools.
np.linalg.solve()
. Store your answer in the Numpy array x
.np.dot(A, x)
to verify that $\mathsf{A}\cdot \mathbf{x} = \mathbf{b}$.np.transpose()
to compute the transpose of A
.np.linalg.inv()
to compute the inverse of A
.# 1.
x = np.linalg.solve(A, b)
# 2.
print('A . x:\n', np.dot(A, x))
print('\nb:\n', b)
# 3.
print('\ntranspose of A:\n', np.transpose(A))
# 4.
print('\ninverse of A:\n', np.linalg.inv(A))
c) Sometimes you want to convert a two-dimensional array to a one-dimensional array. This can be done with np.ravel()
.
B = np.ravel(A)
.np.reshape()
. Then, reshape B
to make it look like A
again.# 1.
B = np.ravel(A)
print(B)
# 2.
np.reshape(B, A.shape)
Write a function with call signature
ecdf_vals(data)
which takes a one-dimensional NumPy array (or Pandas Series; the same construction of your function will work for both) of data and returns the x
and y
values for plotting a "dot-style" ECDF. That is, each dot has a y
value given by the ECDF evaluated at x
. As a reminder,
ECDF(x) = fraction of data points ≤ x.
When writing a function, you should have detailed doc strings and checking of input. However, here the focus is on your understanding of ECDFs and developing skills with NumPy, so you do not need to have a very descriptive doc string nor lots of input checking.
Hint: The functions np.sort()
and np.arange()
may be useful.
def ecdf_vals(data):
"""Return x and y values for an ECDF."""
return np.sort(data), np.arange(1, len(data)+1) / len(data)
We might be interested to see if the egg cross-section data follow a Normal distribution. After all, this is commonly an underlying assumption when people report data from repeated measurements in the literature.
One way to assess this is to plot the theoretical CDF with the same mean and standard deviation as the data on the same plot as the ECDFs. (There are better graphical ways to do this, but this is ok for our purposes here.) We know the cumulative distribution function for a Normal distribution with mean $\mu$ and standard deviation $\sigma$ is
\begin{align} \mathrm{cdf}(x) = \frac{1}{2}\left(1 + \mathrm{erf}\left(\frac{x - \mu}{\sqrt{2\sigma^2}}\right)\right), \end{align}but instead of coding this up directly, we can use the scipy.stats
module to do it for us! We just need to supply where we want the CDF evaluated ($x$), and the mean (the location parameter) and standard deviation (the scale parameter).
Now, let's make the plot.
scipy.stats
module to make the curves. I am intentionally not telling you how to do this nor giving you a link to the docs. This will help you practice using Google to figure out how to use these tools.# Make smooth x-values
x = np.linspace(1500, 2500, 400)
# Compute theoretical Normal distributions
cdf_low = scipy.stats.norm.cdf(x, loc=np.mean(xa_low), scale=np.std(xa_low))
cdf_high = scipy.stats.norm.cdf(x, loc=np.mean(xa_high), scale=np.std(xa_high))
# Color palette
palette = bokeh.palettes.d3['Category10'][10]
# Make a plot of the theoretical distributions
p = bokeh.plotting.figure(
height=250,
width=350,
x_axis_label='cross sectional area (µm²)',
y_axis_label='CDF',
)
p.line(
x=x,
y=cdf_low,
color=palette[0],
line_width=2,
)
p.line(
x=x,
y=cdf_high,
color=palette[1],
line_width=2,
)
# Overlay ECDFs
p = bokeh_catplot.ecdf(
data=df,
cats='food',
val='area (sq. um)',
order=['low', 'high'],
palette=palette,
p=p,
)
p.legend.title = 'food level'
bokeh.io.show(p)
It is important to note that I didn't just memorize how all of these functions work when I wrote these practice exercises. I looked at the online documentation. For example, I looked at the scipy.stats.norm
documentation. To find those links, I just Googled "scipy.stats normal".
These packages are all very well documented, and those docs will be your guide. You don't need to memorize (though you will eventually just by accident).
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,bokeh,bokeh_catplot,jupyterlab