Lesson 29: Practice with Numpy¶

(c) 2019 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This lesson was generated from a Jupyter notebook. You can download the notebook here.

import numpy as np
import pandas as pd

Numpy arrays can take a while to get the hang of. Therefore, it's important to practice practice practice!

Practice 1: Computing things!¶

In the last lesson, we looked at a data set from Harvey and Orbidans on the cross-sectional area of C. elegans eggs. Recall, we loaded the data and converted everything to Numpy arrays like this:

df = pd.read_csv('data/c_elegans_egg_xa.csv', comment='#')

xa_high = df.loc[df['food']=='high', 'area (sq. um)'].values
xa_low = df.loc[df['food']=='low', 'area (sq. um)'].values

Now we would like to compute the diameter of the egg from the cross-sectional area. Write a function that takes in an array of cross-sectional areas and returns an array of diameters. Recall that the diameter $d$ and cross-sectional area $A$ are related by $A = \pi d^2/4$. There should be no for loops in your function! The call signature is

xa_to_diameter(xa)

Use your function to compute the diameters of the eggs.

Practice 2: Working with two-dimensional arrays¶

Numpy enables you do to matrix calculations on two-dimensional arrays. In exercise, you will practice doing matrix calculations on arrays. We'll start by making a matrix and a vector to practice with. You can copy and paste the code below.

A = np.array([[6.7, 1.3, 0.6, 0.7],
              [0.1, 5.5, 0.4, 2.4],
              [1.1, 0.8, 4.5, 1.7],
              [0.0, 1.5, 3.4, 7.5]])

b = np.array([1.1, 2.3, 3.3, 3.9])

a) First, let's practice slicing.

Print row 1 (remember, indexing starts at zero) of A.
Print columns 1 and 3 of A.
Print the values of every entry in A that is greater than 2.
Print the diagonal of A. using the np.diag() function.

b) The np.linalg module has some powerful linear algebra tools.

First, we'll solve the linear system $\mathsf{A}\cdot \mathbf{x} = \mathbf{b}$. Try it out: use np.linalg.solve(). Store your answer in the Numpy array x.
Now do np.dot(A, x) to verify that $\mathsf{A}\cdot \mathbf{x} = \mathbf{b}$.
Use np.transpose() to compute the transpose of A.
Use np.linalg.inv() to compute the inverse of A.

c) Sometimes you want to convert a two-dimensional array to a one-dimensional array. This can be done with np.ravel().

See what happens when you do B = np.ravel(A).
Look of the documentation for np.reshape(). Then, reshape B to make it look like A again.

Practice 3: Understanding and building ECDFs¶

Write a function with call signature

ecdf_vals(data)

which takes a one-dimensional NumPy array (or Pandas Series; the same construction of your function will work for both) of data and returns the x and y values for plotting a "dot-style" ECDF. That is, each dot has a y value given by the ECDF evaluated at x. As a reminder,

ECDF(x) = fraction of data points ≤ x.

When writing a function, you should have detailed doc strings and checking of input. However, here the focus is on your understanding of ECDFs and developing skills with NumPy, so you do not need to have a very descriptive doc string nor lots of input checking.

Hint: The functions np.sort() and np.arange() may be useful.

Practice 4: Are they Normally distributed?¶

We might be interested to see if the egg cross-section data follow a Normal distribution. After all, this is commonly an underlying assumption when people report data from repeated measurements in the literature.

One way to assess this is to plot the theoretical CDF with the same mean and standard deviation as the data on the same plot as the ECDFs. (There are better graphical ways to do this, but this is ok for our purposes here.) We know the cumulative distribution function for a Normal distribution with mean $\mu$ and standard deviation $\sigma$ is

\begin{align} \mathrm{cdf}(x) = \frac{1}{2}\left(1 + \mathrm{erf}\left(\frac{x - \mu}{\sqrt{2\sigma^2}}\right)\right), \end{align}

but instead of coding this up directly, we can use the scipy.stats module to do it for us! We just need to supply where we want the CDF evaluated ($x$), and the mean (the location parameter) and standard deviation (the scale parameter).

Now, let's make the plot.

Compute smooth curves for Normal CDFs. Use the scipy.stats module to make the curves. I am intentionally not telling you how to do this nor giving you a link to the docs. This will help you practice using Google to figure out how to use these tools.
Overlay the ECDFs of the cross-sectional areas to give a graphical evaluation of the Normality of the data.

Computing environment¶

%load_ext watermark
%watermark -v -p numpy,pandas,jupyterlab

CPython 3.7.3
IPython 7.1.1

numpy 1.16.4
pandas 0.24.2
jupyterlab 0.35.5