(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This lesson was generated from a Jupyter notebook. You can download the notebook here.
import numpy as np
import scipy.stats
import pandas as pd
import altair as alt
import bootcamp_utils
Numpy arrays can take a while to get the hang of. Therefore, it's important to practice practice practice!
In the last lesson, we looked at a data set from Harvey and Orbidans on the cross-sectional area of C. elegans eggs. Recall, we loaded the data and converted everything to Numpy arrays like this:
df = pd.read_csv('data/c_elegans_egg_xa.csv', comment='#')
df = df.rename(columns={'area (sq. um)': 'area (sq um)'})
xa_high = df.loc[df['food']=='high', 'area (sq um)'].values
xa_low = df.loc[df['food']=='low', 'area (sq um)'].values
Now we would like to compute the diameter of the egg from the cross-sectional area. Write a function that takes in an array of cross-sectional areas and returns an array of diameters. Recall that the diameter $d$ and cross-sectional area $A$ are related by $A = \pi d^2/4$. There should be no for
loops in your function!
Below, is a skeleton for your function for you to fill in.
def xa_to_diameter(xa):
"""
Convert an array of cross-sectional areas
to diameters with commensurate units.
"""
# Compute diameter from area
diameter = ____
return diameter
Use your function to compute the diameters of the eggs.
def xa_to_diameter(xa):
"""
Convert an array of cross-sectional areas
to diameters with commensurate units.
"""
# Compute diameter from area
diameter = 2 * np.sqrt(xa / np.pi)
return diameter
print('Diameters of eggs from well fed mothers:\n', xa_to_diameter(xa_high))
print('\nDiameters of eggs from poorly fed mothers:\n', xa_to_diameter(xa_low))
Numpy enables you do to matrix calculations on two-dimensional arrays. In exercise, you will practice doing matrix calculations on arrays. We'll start by making a matrix and a vector to practice with. You can copy and paste the code below.
A = np.array([[6.7, 1.3, 0.6, 0.7],
[0.1, 5.5, 0.4, 2.4],
[1.1, 0.8, 4.5, 1.7],
[0.0, 1.5, 3.4, 7.5]])
b = np.array([1.1, 2.3, 3.3, 3.9])
a) First, let's practice slicing.
A
.A
.A
that is greater than 2.A
. using the np.diag()
function.# 1.
print(A[1,:])
# 2.
print(A[:,[1,3]])
# 3.
print(A[A>2])
# 4.
print(np.diag(A))
b) The np.linalg
module has some powerful linear algebra tools.
np.linalg.solve()
. Store your answer in the Numpy array x
.np.dot(A, x)
to verify that $\mathsf{A}\cdot \mathbf{x} = \mathbf{b}$.np.transpose()
to compute the transpose of A
.np.linalg.inv()
to compute the inverse of A
.# 1.
x = np.linalg.solve(A, b)
# 2.
print('A . x:\n', np.dot(A, x))
print('\nb:\n', b)
# 3.
print('\ntranspose of A:\n', np.transpose(A))
# 4.
print('\ninverse of A:\n', np.linalg.inv(A))
c) Sometimes you want to convert a two-dimensional array to a one-dimensional array. This can be done with np.ravel()
.
B = np.ravel(A)
.np.reshape()
. Then, reshape B
to make it look like A
again.# 1.
B = np.ravel(A)
print(B)
# 2.
np.reshape(B, A.shape)
We might be interested to see if the egg cross-section data follow a Normal distribution. After all, this is commonly an underlying assumption when people report data from repeated measurements in the literature.
One way to assess this is to plot the theoretical CDF with the same mean and standard deviation as the data on top of the ECDFs. (There are better graphical ways to do this, but this is ok for our purposes here.) We know the cumulative distribution function for a Normal distribution with mean $\mu$ and standard deviation $\sigma$ is
\begin{align} \mathrm{cdf}(x) = \frac{1}{2}\left(1 + \mathrm{erf}\left(\frac{x - \mu}{\sqrt{2\sigma^2}}\right)\right), \end{align}
but instead of coding this up directly, we can use the scipy.stats
to do it for us! We just need to supply where we want the CDF evaluated ($x$), and the mean (the location parameter) and standard deviation (the scale parameter). Something like this:
# Make smooth x-values
x = np.linspace(1600, 2500, 400)
# Compute theoretical Normal distribution
cdf_theor = scipy.stats.norm.cdf(x, loc=np.mean(xa_low), scale=np.std(xa_low))
Now, let's make the plot.
- Generate a data frame that you can use to plot the ECDFs of egg cross-sectional area for worms with high food and low food. I would prefer to plot it with the "dot" style, but it is up to you.
- Make smooth curves of the Normal CDF using
scipy.stats.norm.cdf()
and place them in a data frame suitable for plotting.- Make plots of the smooth curves and the data.
# 1. Generate ECDFs
grouped = df.groupby('food')
df['ECDF'] = grouped.transform(bootcamp_utils.ecdf_y)
# 2. Generate a data fram with smooth curves
df_summary = grouped.agg([np.mean, np.std])
x = np.linspace(1600, 2500, 400)
df_list = []
for key, g in grouped['area (sq um)']:
cdf = scipy.stats.norm.cdf(x, loc=g.mean(), scale=g.std())
df_list.append(pd.DataFrame(data={'x': x,
'CDF': cdf,
'food': key}))
df_smooth = pd.concat(df_list, ignore_index=True)
# 3. Make plots
smooth = alt.Chart(df_smooth
).mark_line(
color='gray'
).encode(
x=alt.X('x:Q', title='egg cross sectional area (µm²)'),
y='CDF:Q',
color=alt.Color('food:N', legend=None),
order='x:Q')
dots = alt.Chart(df
).mark_point(
).encode(
x=alt.X('area (sq um):Q', title='egg cross sectional area (µm²)'),
y=alt.Y('ECDF:Q', title='CDF'),
color=alt.Color('food:N'))
smooth + dots
It is important to note that I didn't just memorize how all of these functions work when I wrote these practice exercises. I looked at the online documentation. For example, I looked at the alt.Color
documentation, and the scipy.stats.norm
documentation. To find those links, I just Googled "Altair color legend" and "scipy.stats
".
These packages are all very well documented, and those docs will be your guide. You don't need to memorize (though you will eventually just by accident).