Lesson 20: Introduction to NumPy and the SciPy stack

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

In [2]:
# NumPy, the engine of scientific computing
import numpy as np

# We'll demo a couple SciPy modules
import scipy.special

We have worked with packages from the standard library (such as re and os) and from outside the standard library (pytest). In this lesson, we will learn about NumPy, arguably the most important package for scientific computing. NumPy is part of the SciPy stack, a collection of packages for doing various scientific computations. Most of the SciPy stack is built on NumPy and its power ndarray data type. We will devote the entire next lesson to ndarrays, but for now, we will take a brief tour through what NumPy and SciPy offer.

The SciPy stack

The SciPy stack consists of the following core packages.

Package Description
NumPy Base routines and classes for handling data and numerical calculation
SciPy Expansive set of tools for scientific calculation
matplotlib Plotting library
IPython The interactive Python console you already know and love
Pandas A powerful data analysis library
SymPy Symbolic mathematics library

With the exception of SymPy, we will use all of these extensively in the next few days. In this lesson, we will show some of the basic features of NumPy and SciPy.

Importantly, everything in the SciPy stack is BSD licensed. This is a very permissive license, which basically means you can do whatever you want when using the code. You just need to include the license if you redistribute it.

Outside of this SciPy stack are important projects that either are too specific to be included in the monolithic SciPy stack, or are still under active development. Importantly, this includes scikit-image, a package for image processing, which we will use later in the course. The new scikit-bio is still in the nascent stages of development and promises to be excellent. We will not use it in the course, since it currently does not run on Windows. We will instead use Biopython, which is a mature package for parsing data in a bioinformatics context.

A very brief introduction to NumPy arrays

The central object for NumPy and SciPy is the ndarray, commonly referred to as a "NumPy array." This is an array object that is convenient for scientific computing. We will go over it in depth in the next lesson, but for now, let's just create some NumPy arrays and see how operators work on them.

Just like with type conversions with lists, tuples, and other data types we've looked at, we can convert a list to a NumPy array using

np.array()

Note that above we imported the NumPy package "as np". This is for convenience; it allow us to use np as a prefix instead of numpy. NumPy is in very widespread use, and the convention is to use the np abbreviation.

In [4]:
# Create a NumPy array from a list
my_ar = np.array([1, 2, 3, 4])

# Look at it
my_ar
Out[4]:
array([1, 2, 3, 4])

We see that the list has been converted and it is explicitly shown as an array. It has several attributes and lots of methods. The most important attributes are probably the data type of its elements and the shape of the array.

In [9]:
# The data type of stored entries
my_ar.dtype
Out[9]:
dtype('int64')
In [11]:
# The shape of the array
my_ar.shape
Out[11]:
(4,)

There are also lots of methods. Here are just a few.

In [13]:
print(my_ar.max())
print(my_ar.min())
print(my_ar.sum())
print(my_ar.mean())
print(my_ar.std())
4
1
10
2.5
1.11803398875

Importantly, NumPy arrays can be arguments to NumPy functions. The functions are intuitively applied, often element by element, to the arrays.

Other ways to make NumPy arrays

There are many other ways to make NumPy arrays besides just converting lists or tuples. Below are some examples.

In [3]:
# How long our arrays will be
n = 10

# Make a NumPy array of length n filled with zeros
np.zeros(n)
Out[3]:
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
In [4]:
# Make a NumPy array of length n filled with ones
np.ones(n)
Out[4]:
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
In [5]:
# Make an empty NumPy array of length n (will show zeros when displayed)
np.empty(n)
Out[5]:
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])
In [6]:
# Make a NumPy array filled with zeros the same shape as another NumPy array
my_ar = np.array([[1, 2], [3, 4]])
np.zeros_like(my_ar)
Out[6]:
array([[0, 0],
       [0, 0]])

NumPy has useful mathematical functions

So far, we have not done much mathematics with Python. We have done some adding and divison, but nothing like computing a logarithm, cosine, or even square roots. Check it out!

In [15]:
# Exponential
np.exp(my_ar)
Out[15]:
array([  2.71828183,   7.3890561 ,  20.08553692,  54.59815003])
In [16]:
# Cosine
np.cos(my_ar)
Out[16]:
array([ 0.54030231, -0.41614684, -0.9899925 , -0.65364362])
In [17]:
# Square root
np.sqrt(my_ar)
Out[17]:
array([ 1.        ,  1.41421356,  1.73205081,  2.        ])

We can even do some matrix operations, like dot products.

In [19]:
np.dot(my_ar, my_ar)
Out[19]:
30

NumPy also has useful attributes, like np.pi.

In [18]:
np.pi
Out[18]:
3.141592653589793

SciPy has even more useful functions (in modules)

SciPy actually began life as a library of special functions that operate on NumPy arrays. For example, we can compute an error function, we can use the scipy.special module, which contains lots of special functions. Note that you often have to individually import the SciPy module you want to use, for example with

import scipy.special
In [21]:
scipy.special.erf(my_ar)
Out[21]:
array([ 0.84270079,  0.99532227,  0.99997791,  0.99999998])

NumPy and SciPy are highly optimized

Importantly, NumPy and SciPy routines are often fast. To understand why, we need to think a bit about how your computer actually runs code you write.

Interpreted and compiled languages

We have touched on the fact that Python is an interpreted language. This means that the Python interpreter reads through your code, line by line, translates the instructions into machine code, and then these are executed. As an interpreted langauge, code is often much easier to write, and development time is much shorter. It is often easier to debug. By contrast, compiled languages (the dominant ones being Fortran, C, and C++), your entire code is translated into machine language before you ever run it. When you execute your program, it is already in machine langauge. As a result, compiled code is often much faster than interpreted code. The speed difference depends largely on the task at hand, but is over a 100 to 1000-fold difference.

First, we'll demonstrate the difference between compiled and interpreted languages by looking at a function to sum the elements of an array. Note that Python is dynamically typed, so the function below works for multiple data types, but the C function works only for double precision floating point numbers.

In [24]:
# Python code to sum an array and print the result to the screen
print(sum(my_ar))
10
/* C code to sum an array and print the result to the screen */ #include void sum_array(double a[], int n); void sum_array(doubt a[], int n) { int i; double sum=0; for (i = 0; i < n; i++){ sum += a[i]; } printf("%d\n", sum); }

The C code won't even execute without another function called main to call it. You should notice the difference in complexit of the code. Interpreted code is very often much easier to write!

NumPy and SciPy use compiled code!

Under the hood, when you call a NumPy or SciPy function, or use one of the methods, the Python interpreter passes the arrays into pre-compiled functions. (They are usually C or Fortran functions.) That means that you get to use an interpreted language with near-compiled speed! We can demonstrate the speed by comparing an explicit sum of elements of an array using a Python for loop versus NumPy. We will use the np.random module to generate a large array of random numbers (we will visit random number generation in a coming tutorial). We then use the %timeit magic function of IPython to time the execution of the sum of the elements of the array.

In [43]:
# Make array of 10,000 random numbers
x = np.random.random(10000)

# Sum with Python for loop
def python_sum(x):
    x_sum = 0.0
    for y in x:
        x_sum += y
    return x_sum

# Test speed
%timeit python_sum(x)
1000 loops, best of 3: 1.2 ms per loop

Now we'll do the same test with the NumPy implementation.

In [44]:
%timeit x.sum()
The slowest run took 6.48 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 6.39 µs per loop

Wow! We went from 1 millisecond to 6 microseconds! That's a factor of about 170!

Word of advice: use NumPy and SciPy

If you are writing code, and you think to yourself, "This seems like a pretty common things to do," there is a good chance the someone really smart has written code to do it. If it's something numerical, there is a good chance it is in NumPy or SciPy. Use these packages. Do not reinvent the wheel. It is very rare you can beat them for performance, error checking, etc.

Furthermore, NumPy and SciPy are very well unit tested. In general, you do not need to write unit tests for well-established packages. Obviously, if you use NumPy or SciPy within your own functions, you still need to unit test what you wrote.