Lesson 10: Packages and modules

(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

The Python Standard Library has lots of built-in modules that contain useful functions and data types for doing specific tasks. You can also use modules that other people write. And you will undoubtedly write your own modules!

A module is contained in a file that ends with .py. This file can have classes, functions, and other objects. We will not discuss defining your own classes in the bootcamp, so your modules will essentially just contain functions.

A package contains several related modules that are all grouped together under one name. We will extensively use the NumPy, SciPy, Pandas, and Matplotlib packages, among others, in the bootcamp, and I'm sure you will also use them beyond. As such, the first module we will consider is NumPy. We will talk a lot more about NumPy later in the bootcamp.

Example: I want to compute the median of a list of numbers

Say I have a list of numbers and I want to compute the mean. This happens all the time; you repeat a measurement multiple times and you want to compute the mean. We could write a function to do this. In fact, you did this in the exercises.

In [14]:
def mean(values):
    """Compute the mean of a sequence of numbers."""
    return sum(values) / len(values)

And it works as expected.

In [19]:
print(mean([1, 2, 3, 4, 5]))
print(mean((4.5, 1.2, -1.6, 9.0)))
3.0
3.275

In addition to the mean, we might also want to compute the median, the standard deviation, etc. These seem like really common tasks. Remember my advice: if you want to do something that seems really common, a good programmer (or a team of them) probably already wrote something to do that. Computing means, medians, standard deviations, and lots and lots and lots of other numerical things are included in the Numpy module. To get access to it, we have to import it.

In [20]:
import numpy

That's it! We now have the numpy module available for use. Remember, in Python everything is an object, so if we want to access the methods and attributes, available in the numpy module, we use dot syntax. If we're using IPython, we can type

numpy.

(note the dot) and hit tab, and we will see what is available. For Numpy, there is a huge number of options!

So, let's try to use Numpy's numpy.mean() function to compute a mean.

In [21]:
print(numpy.mean([1, 2, 3, 4, 5]))
print(numpy.mean((4.5, 1.2, -1.6, 9.0)))
3.0
3.275

Great! We get the same values! Now, we can use the numpy.median() function to compute the median.

In [22]:
print(numpy.median([1, 2, 3, 4, 5]))
print(numpy.median((4.5, 1.2, -1.6, 9.0)))
3.0
2.85

This is nice. It gives the median, including when we have an even number of element in the sequence of numbers, in which case it automatically interpolates.

The as keyword

We use Numpy all the time. Typing numpy over and over again can get annoying. So, it is common practice to use the as keyword to import a module with an alias. Numpy's alias is traditionally np.

In [24]:
import numpy as np

np.median((4.5, 1.2, -1.6, 9.0))
Out[24]:
2.8500000000000001

I prefer to do things this way, though some purists differ. We will use traditional aliases for major packages like SciPy, scikit-image, Matplotlib, Seaborn, and Pandas throughout the bootcamp.

Third party packages

Standard Python installations come with the standard library. Numpy and other useful packages are not in the standard library. Outside of the standard library, there are several packages available. Several. Ha! There are currently about 88,000 packages available through the Python Package Index, PyPI. Usually, you can ask Google about what you are trying to do, and there is often a third party module to help you do it. The most useful (for scientific computing) and thoroughly tested packages and modules are available using conda.

Writing your own module

To write your own module, you need to create a .py file a save it. Let's call our module dnatorna. So, we create a file called dnatorna.py. We'll build this module to have two functions, based on things we've already written. We'll have a function rna(), which converts a DNA sequence to an RNA sequence (just changes T to U), and another function reverse_rna_complement(), which returns the reverse RNA complement of a DNA template. The contents of dnatorna should look as follows (ignoring the first line, which was used to load in the contents of the module into this Jupyter notebook).

In [27]:
# %load dnatorna.py
"""
Convert DNA sequences to RNA.
"""

def rna(seq):
    """
    Convert a DNA sequence to RNA.
    """

    # Determine if original sequence was uppercase
    seq_upper = seq.isupper()

    # Convert to lowercase
    seq = seq.lower()

    # Swap out 't' for 'u'
    seq = seq.replace('t', 'u')

    # Return upper or lower case RNA sequence
    if seq_upper:
        return seq.upper()
    else:
        return seq


def reverse_rna_complement(seq):
    """
    Convert a DNA sequence into its reverse complement as RNA.
    """

    # Determine if original was uppercase
    seq_upper = seq.isupper()

    # Reverse sequence
    seq = seq[::-1]

    # Convert to upper
    seq = seq.upper()

    # Compute complement
    seq = seq.replace('A', 'u')
    seq = seq.replace('T', 'a')
    seq = seq.replace('G', 'c')
    seq = seq.replace('C', 'g')

    # Return result
    if seq_upper:
        return seq.upper()
    else:
        return seq

Note that the file starts with a doc string. Here's a rule.

All modules should start with doc strings.

I then have my two functions, each with doc strings. We will now import the module and then use these functions.

In [28]:
import dnatorna

# Sequence
seq = 'GACGATCTAGGCGACCGACTGGCATCG'

# Convert to RNA
dnatorna.rna(seq)
Out[28]:
'GACGAUCUAGGCGACCGACUGGCAUCG'

We can also compute the reverse RNA complement.

In [29]:
dnatorna.reverse_rna_complement(seq)
Out[29]:
'CGAUGCCAGUCGGUCGCCUAGAUCGUC'

Wonderful! You now have your own functioning module!

Importing modules in your .py files

As our first foray into the glory of PEP 8, the Python style guide, we quote:

Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.

Imports should be grouped in the following order:

  1. standard library imports
  2. related third party imports
  3. local application/library specific imports

You should put a blank line between each group of imports.

You should follow this guide. Therefore, going forward all of our lessons will have all necessary imports at the top of the document. The only exception is when we are explicitly demonstrating a concept that requires an import.

PYTHONPATH

When we wrote the dnatorna module, we stored it in the directory that we were working in, or the pwd. But what if you have a directory on your machine where you like to keep your coding projects? (Actually, you will definitely have such a thing after we teach you about version control with Git in the next lesson.) To allow for this, you should set your $PYTHONPATH environment variable. This tells Python/IPython where to look for your modules. conda takes care of this for you for anything managed with conda, but obviously not for your own packages. In the lesson on command line skills, you set up a directory for your bootcamp files with the path ~/git/bootcamp/. To access the modules you write an keep in that directory, you need to include it in your PYTHONPATH. To do that, do the following from the command line:

export PYTHONPATH=${PYTHONPATH}:$HOME/git/bootcamp

Then, anything you put in that directory will be available to the Python interpreter. You can also put this in your .bashrc file so it is always available.