(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This tutorial was generated from a Jupyter notebook. You can download the notebook here.
# NumPy, the engine of scientific computing
import numpy as np
We just got an introduction to NumPy and SciPy. The packages are extensive. At the center is the NumPy array data type. We will explore this data type in this tutorial. As it is always more fun to work with a real biological application, we will populate our NumPy arrays with data.
In their 2011 paper in PLoS ONE, Harvey and Orbidans measured the cross-sectional area of C. elegans eggs that came from mothers who had a high concentration of food and from mothers of a low concentration of food. I digitized the data from their plots, and they are available in the files data/xa_high_food.csv
and data/xa_low_food.csv
in the bootcamp repository.
NumPy has a primitive function for loading in data from text files, np.loadtxt()
. We will use it here, but will quickly jettison tomorrow in favor of the much more powerful functionality in Pandas. For now, we will load our two arrays using np.loadtxt()
. Let's first look at the files.
!head data/xa_high_food.csv
The actual numbers are preceded with comments starting with a pound sign, so we know to ignore that, which we can do using np.loadtxt()
's comments
kwarg. Furthermore, we note that the cross sectional areas are given in units of µm$^2$. Let's load in the data.
# Load in data
xa_high = np.loadtxt('data/xa_high_food.csv', comments='#')
xa_low = np.loadtxt('data/xa_low_food.csv', comments='#')
Let's take a look at these arrays.
xa_high
xa_low
We saw in the previous tutorial that NumPy arrays are a special data type. They have well-defined ways in which our familiar operators work with them. Let's learn about this by example.
We'll start with multiplying by an array by a constant. Say we wanted to convert the units of the cross section area from µm$^2$ to mm$^2$. This means we have to divide every entry by 10$^6$ (or multiply by 10$^{-6}$). Multiplication by a scalar works elementwise on NumPy arrays. Check it out.
xa_high / 1e6
Notice that 1e6
is how we represent numbers in Python in scientific notation, and that dividing the NumPy array by this number resulted in every entry in the array being divided. The +
, -
, and *
operators all work in this way. For example:
xa_high + 1000
Let's see what happens when we compare a NumPy array to a scalar.
xa_high < 2000
We get an array of Booleans! The comparison is elementwise. This is important to know because we cannot use these comparisons with an if
clause.
if xa_high > 2000:
print('Nothing to print, really. This will just be an error.')
Let's take the advice from the exception and use the .any()
or .all()
operators.
# Check if any values are biggern than 2000
(xa_high > 2000).any()
Remember, the expresson (xa_high > 2000)
is itself a NumPy array of Booleans. The any()
method returns True
if any of the entries in that array are True
. Similarly, the all()
method returns True
if all entries in the array are True
.
(xa_high > 2000).all()
Yup! At least one cross sectional area is greater than 2000 µm$^2$ but not all of them.
Remember, you should never use the equality operator (==
) with float
s. Fortunately, NumPy offers a couple nice functions to check if two numbers are almost equal. This helps deal with the numerical precision issues when comparing float
s. The np.isclose()
function checks to see if two numbers are close in value. It works elementwise for NumPy arrays.
# Compare two numbers
np.isclose(1.3, 1.29999999999)
# Compare an array to a scalar
np.isclose(xa_high, 1800)
A couple cross section are 1800 µm$^2$. The np.allclose()
function checks to see if all values in a NumPy array are close.
np.allclose(xa_high, 1800)
We can apply operators with two NumPy ararys. Let's give it a whirl. (This is meaningless in the context of the actual data contained in these arrays, but it's an operation we need to understand.)
xa_high + xa_low
Yikes! The exception tells us that the two arrays we are using the operator on need to have the same shape. This makes sense: if we are going to do element-by-element addition, the arrays better have the same number of elements. To continue with our operators on two arrays, we'll slice the longer NumPy array. The basic slicing syntax is the same as for strings, lists, and tuples.
# Just take the first elements
xa_low_slice = xa_low[:len(xa_high)]
Ok, let's try adding arrays again.
xa_high + xa_low_slice
Great! We get element-by-element addition. The same happens for the other operators we've discussed. np.isclose()
also operates element-by-element.
np.isclose(xa_high, xa_low_slice)
We already saw that we can slice NumPy arrays like lists and tuples. Here are a few examples.
# Reversed array
xa_high[::-1]
# Every 5th element, starting at index 3
xa_high[3::5]
# Entries 10 to 20
xa_high[10:21]
NumPy arrays also allow fancy indexing, where we can slice out specific values. For example, say we wanted indices 1, 19, and 6 (in that order) from xa_high
. We just index with a list of the indices we want.
xa_high[[1, 19, 6]]
Instead of a list, we could also use a NumPy array.
xa_high[np.array([1, 19, 6])]
As a very nice feature, we can slice with a NumPy array of Booleans, and we'll just get back the True
values. So, say we only want the egg cross sectional areas that are greater than 2000 µm$^2$.
# Just slice out the big ones
xa_high[xa_high > 2000]
If we want to know the indices where the values are high, we can use the np.where()
function.
np.where(xa_high > 2000)
The big bold words above say it all. Let's look at some consequences.
# Make an array
my_ar = np.array([1, 2, 3, 4])
# Change an element
my_ar[2] = 6
# See the result
my_ar
Now, let's try working attaching another variable to the NumPy array.
# Attach a new variable
my_ar2 = my_ar
# Set an entry using the new variable
my_ar2[3] = 9
# Does the original change? (yes.)
my_ar
Let's see how messing with NumPy in functions affects things.
# Re-instantiate my_ar
my_ar = np.array([1, 2, 3, 4]).astype(float)
# Function to normalize x (note that /= works with mutable objects)
def normalize(x):
x /= np.sum(x)
# Pass it through a function
normalize(my_ar)
# Is it normalized? (Yes.)
my_ar
So, be careful when writing functions. What you do to your NumPy array inside the function will happen outside of the function as well.
A very important distinction between NumPy arrays and lists is that slices of NumPy arrays are views into the original NumPy array, NOT copies.
# Make list and array
my_list = [1, 2, 3, 4]
my_ar = np.array(my_list)
# Slice out of each
my_list_slice = my_list[1:-1]
my_ar_slice = my_ar[1:-1]
# Mess with the slices
my_list_slice[0] = 9
my_ar_slice[0] = 9
# Look at originals
print(my_list)
print(my_ar)
Messing with an element of a slice of a NumPy array messes with that element in the original! This is not the case with lists. Let's issue a warning.
Fortunately, you can make a copy of an array using the np.copy()
function.
# Make a copy
xa_high_copy = np.copy(xa_high)
# Mess with an entry
xa_high_copy[10] = 2000
# Check equality
np.allclose(xa_high, xa_high_copy)
So, messing with an entry in the copy did not affect the original.
NumPy arrays need not be one-dimensional. We'll create a two-dimensional NumPy array by reshaping our xa_high
array from having shape (44,)
to having shape (11, 4)
. That is, it will become an array with 11 rows and 4 columns.
# New 2D array using the reshape() method
my_ar = xa_high.reshape((11, 4))
# Look at it
my_ar
Notice that it is represented as an array made out of a list of lists. If we had a list of lists, we would index it like this:
list_of_lists[i][j]
# Make list of lists
list_of_lists = [[1, 2], [3, 4]]
# Pull out value in first row, second column
list_of_lists[0][1]
This is not how NumPy arrays are indexed. They are indexed much more conveniently.
my_ar[0,1]
We essentially have a tuple in the indexing brackets. Now, say we wanted the second row (indexing starting at 0).
my_ar[2,:]
We can use Boolean indexing as before.
my_ar[my_ar > 2000]
Note that this give a one-dimensional list of the entries greater than 2000. If we wanted indices where this is the case, we can again use np.where()
.
np.where(my_ar > 2000)
This tuple of NumPy arrays is how we would index using fancy indexing to pull those values out using fancy indexing.
my_ar[(np.array([ 0, 1, 8, 10, 10]), np.array([1, 0, 1, 0, 2]))]
NumPy arrays can be of arbitrary integer dimension, and these principles extrapolate to 3D, 4D, etc., arrays.
Let's say we want to study all cross sectional areas and don't care if the mother was well-fed or not. We would want to concatenate our arrays. The np.concatenate()
function accomplishes this. We simply have to pass it a tuple containing the NumPy arrays we want to concatenate.
combined = np.concatenate((xa_high, xa_low))
# Look at it
combined
Finally, let's use our data to compare the difference in cross section areas of the eggs of mothers eating high vs. low-density food? We can use some handy NumPy functions to do it.
# Compute 25, 50, and 75 percentiles
high_perc = np.percentile(xa_high, (25, 50, 75))
low_perc = np.percentile(xa_low, (25, 50, 75))
# Print result
print("""
25 median 75
high {0:d} {1:d} {2:d}
low {3:d} {4:d} {5:d}
""".format(*(tuple(high_perc.astype(int)) + tuple(low_perc.astype(int)))))
So, we see that the eggs from mothers eating lower-density food have larger cross-sectional areas.