Lesson 29: Practice with Numpy

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This lesson was generated from a Jupyter notebook. You can download the notebook here.



In [1]:
import numpy as np
import scipy.stats
import pandas as pd

import altair as alt

import bootcamp_utils

Numpy arrays can take a while to get the hang of. Therefore, it's important to practice practice practice!

Practice 1: Computing things!

In a previous lesson, we looked at a data set from Harvey and Orbidans on the cross-sectional area of C. elegans eggs. Recall, we loaded the data and converted everything to Numpy arrays like this:

In [2]:
df = pd.read_csv('data/c_elegans_egg_xa.csv', comment='#')
df = df.rename(columns={'area (sq. um)': 'area (sq um)'})

xa_high = df.loc[df['food']=='high', 'area (sq um)'].values
xa_low = df.loc[df['food']=='low', 'area (sq um)'].values

Now we would like to compute the diameter of the egg from the cross-sectional area. Write a function that takes in an array of cross-sectional areas and returns an array of diameters. Recall that the diameter $d$ and cross-sectional area $A$ are related by $A = \pi d^2/4$. There should be no for loops in your function!

Below, is a skeleton for your function for you to fill in.

In [3]:
def xa_to_diameter(xa):
    """
    Convert an array of cross-sectional areas
    to diameters with commensurate units.
    """
    
    # Compute diameter from area
    diameter = ____
    
    return diameter

Use your function to compute the diameters of the eggs.


Practice 2: Working with two-dimensional arrays

Numpy enables you do to matrix calculations on two-dimensional arrays. In exercise, you will practice doing matrix calculations on arrays. We'll start by making a matrix and a vector to practice with. You can copy and paste the code below.

In [4]:
A = np.array([[6.7, 1.3, 0.6, 0.7],
              [0.1, 5.5, 0.4, 2.4],
              [1.1, 0.8, 4.5, 1.7],
              [0.0, 1.5, 3.4, 7.5]])

b = np.array([1.1, 2.3, 3.3, 3.9])

a) First, let's practice slicing.

  1. Print row 1 (remember, indexing starts at zero) of A.
  2. Print columns 1 and 3 of A.
  3. Print the values of every entry in A that is greater than 2.
  4. Print the diagonal of A. using the np.diag() function.

b) The np.linalg module has some powerful linear algebra tools.

  1. First, we'll solve the linear system $\mathsf{A}\cdot \mathbf{x} = \mathbf{b}$. Try it out: use np.linalg.solve(). Store your answer in the Numpy array x.
  2. Now do np.dot(A, x) to verify that $\mathsf{A}\cdot \mathbf{x} = \mathbf{b}$.
  3. Use np.transpose() to compute the transpose of A.
  4. Use np.linalg.inv() to compute the inverse of A.

c) Sometimes you want to convert a two-dimensional array to a one-dimensional array. This can be done with np.ravel().

  1. See what happens when you do B = np.ravel(A).
  2. Look of the documentation for np.reshape(). Then, reshape B to make it look like A again.


Practice 3: Are they Normally distributed?

We might be interested to see if the egg cross-section data follow a Normal distribution. After all, this is commonly an underlying assumption when people report data from repeated measurements in the literature.

One way to assess this is to plot the theoretical CDF with the same mean and standard deviation as the data on top of the ECDFs. (There are better graphical ways to do this, but this is ok for our purposes here.) We know the cumulative distribution function for a Normal distribution with mean $\mu$ and standard deviation $\sigma$ is

\begin{align} \mathrm{cdf}(x) = \frac{1}{2}\left(1 + \mathrm{erf}\left(\frac{x - \mu}{\sqrt{2\sigma^2}}\right)\right), \end{align}

but instead of coding this up directly, we can use the scipy.stats to do it for us! We just need to supply where we want the CDF evaluated ($x$), and the mean (the location parameter) and standard deviation (the scale parameter). Something like this:

In [5]:
# Make smooth x-values
x = np.linspace(1600, 2500, 400)

# Compute theoretical Normal distribution
cdf_theor = scipy.stats.norm.cdf(x, loc=np.mean(xa_low), scale=np.std(xa_low))

Now, let's make the plot.

  1. Generate a data frame that you can use to plot the ECDFs of egg cross-sectional area for worms with high food and low food. I would prefer to plot it with the "dot" style, but it is up to you.
  2. Make smooth curves of the Normal CDF using scipy.stats.norm.cdf() and place them in a data frame suitable for plotting.
  3. Make plots of the smooth curves and the data.


A reminder about documentation

It is important to note that I didn't just memorize how all of these functions work when I wrote these practice exercises. I looked at the online documentation. For example, I looked at the alt.Color documentation, and the scipy.stats.norm documentation. To find those links, I just Googled "Altair color legend" and "scipy.stats".

These packages are all very well documented, and those docs will be your guide. You don't need to memorize (though you will eventually just by accident).