Lesson 10: Packages and modules
The Python Standard Library has lots of built-in modules that contain useful functions and data types for doing specific tasks. You can also use modules from outside the standard library. And you will undoubtedly write your own modules!
A module is contained in a file that ends with .py
. This file can have classes, functions, and other objects. We will not discuss defining your own classes until much later in the bootcamp, so your modules will essentially just contain functions for now.
A package contains several related modules that are all grouped together under one name. We will extensively use the NumPy, SciPy, Pandas, and Bokeh packages, among others, in the bootcamp, and I’m sure you will also use them beyond. As such, the first module we will consider is NumPy. We will talk a lot more about NumPy later in the bootcamp.
Example: I want to compute the mean and median of a list of numbers
Say I have a list of numbers and I want to compute the mean. This happens all the time; you repeat a measurement multiple times and you want to compute the mean. We could write a function to do this.
[1]:
def mean(values):
"""Compute the mean of a sequence of numbers."""
return sum(values) / len(values)
And it works as expected.
[2]:
print(mean([1, 2, 3, 4, 5]))
print(mean((4.5, 1.2, -1.6, 9.0)))
3.0
3.275
In addition to the mean, we might also want to compute the median, the standard deviation, etc. These seem like really common tasks. Remember my advice: if you want to do something that seems really common, a good programmer (or a team of them) probably already wrote something to do that. Means, medians, standard deviations, and lots and lots and lots of other numerical things are included in the Numpy module. To get access to it, we have to import it.
[3]:
import numpy
That’s it! We now have the numpy
module available for use. Remember, in Python everything is an object, so if we want to access the methods and attributes available in the numpy
module, we use dot syntax. In a Jupyter notebook or in the JupyterLab console, you can type
numpy.
(note the dot) and hit tab, and we will see what is available. For Numpy, there is a huge number of options!
So, let’s try to use Numpy’s numpy.mean()
function to compute a mean.
[4]:
print(numpy.mean([1, 2, 3, 4, 5]))
print(numpy.mean((4.5, 1.2, -1.6, 9.0)))
3.0
3.275
Great! We get the same values! Now, we can use the numpy.median()
function to compute the median.
[5]:
print(numpy.median([1, 2, 3, 4, 5]))
print(numpy.median((4.5, 1.2, -1.6, 9.0)))
3.0
2.85
This is nice. It gives the median, including when we have an even number of elements in the sequence of numbers, in which case it automatically interpolates. It is really important to know that it does this interpolation, since if you are not expecting it, it can give unexpected results. So, here is an important piece of advice:
Always check the doc strings of functions.
We can access the doc string of the numpy.median()
function in JupyterLab by typing
numpy.median?
and looking at the output. An important part of that output:
Notes
-----
Given a vector ``V`` of length ``N``, the median of ``V`` is the
middle value of a sorted copy of ``V``, ``V_sorted`` - i
e., ``V_sorted[(N-1)/2]``, when ``N`` is odd, and the average of the
two middle values of ``V_sorted`` when ``N`` is even.
This is where the documentation tells you that the median will be reported as the average of two middle values when the number of elements is even. Note that you could also read the documentation here, which is a bit easier to read.
The as keyword
We use Numpy all the time. Typing numpy
over and over again can get annoying. So, it is common practice to use the as
keyword to import a module with an alias. Numpy’s alias is traditionally np
, and this is the only alias you should ever use for Numpy.
[6]:
import numpy as np
np.median((4.5, 1.2, -1.6, 9.0))
[6]:
2.85
I prefer to do things this way, though some purists differ. We will use traditional aliases for major packages like Numpy and Pandas throughout the bootcamp.
Third party packages
Standard Python installations come with the standard library. Numpy and other useful packages are not in the standard library. Outside of the standard library, there are several packages available. Several. Ha! There are currently (June 12, 2023) about 470,000 packages available through the Python Package Index, PyPI. Usually, you can ask Google about what you are trying to do, and there is often a third party module to help you do it. The most useful (for
scientific computing) and thoroughly tested packages and modules are available using conda
. Others can be installed using pip
, which we will do later in the bootcamp.
Writing your own module
To write your own module, you need to create a .py
file and save it. You can do this using the text editor in JupyterLab. Let’s call our module na_utils
, for “nucleic acid utilities.” So, we create a file called na_utils.py
. We’ll build this module to have two functions, based on things we’ve already written. We’ll have a function dna_to_rna()
, which converts a DNA sequence to an RNA sequence (just changes T
to U
), and another function reverse_rna_complement()
, which
returns the reverse RNA complement of a DNA template. The contents of na_utils.py
should look as follows.
"""
Utilities for parsing nucleic acid sequences.
"""
def dna_to_rna(seq):
"""
Convert a DNA sequence to RNA.
"""
# Determine if original sequence was uppercase
seq_upper = seq.isupper()
# Convert to lowercase
seq = seq.lower()
# Swap out 't' for 'u'
seq = seq.replace('t', 'u')
# Return upper or lower case RNA sequence
if seq_upper:
return seq.upper()
else:
return seq
def reverse_rna_complement(seq):
"""
Convert a DNA sequence into its reverse complement as RNA.
"""
# Determine if original was uppercase
seq_upper = seq.isupper()
# Reverse sequence
seq = seq[::-1]
# Convert to upper
seq = seq.upper()
# Compute complement
seq = seq.replace('A', 'u')
seq = seq.replace('T', 'a')
seq = seq.replace('G', 'c')
seq = seq.replace('C', 'g')
# Return result
if seq_upper:
return seq.upper()
else:
return seq
Note that the file starts with a doc string saying what the module contains.
I then have my two functions, each with doc strings. We will now import the module and then use these functions. In order for the import to work, the file na_utils.py
must be in your present working directory, since this is where the Python interpreter will look for your module. In general, if you execute the code
import my_module
the Python interpreter will look first in the pwd
to find my_module.py
.
[7]:
import na_utils
# Sequence
seq = 'GACGATCTAGGCGACCGACTGGCATCG'
# Convert to RNA
na_utils.dna_to_rna(seq)
[7]:
'GACGAUCUAGGCGACCGACUGGCAUCG'
We can also compute the reverse RNA complement.
[8]:
na_utils.reverse_rna_complement(seq)
[8]:
'CGAUGCCAGUCGGUCGCCUAGAUCGUC'
Wonderful! You now have your own functioning module!
A quick note on error checking
These functions have minimal error checking of the input. For example, the dna_to_rna()
function will take gibberish in and give jibberish out.
[9]:
na_utils.dna_to_rna('You can observe a lot by just watching.')
[9]:
'you can observe a lou by jusu wauching.'
In general, checking input and handling errors is an essential part of writing functions, and we will cover that in a later lesson.
Importing modules in your .py files and notebooks
As our first foray into the glory of PEP 8, the Python style guide, we quote:
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.
Imports should be grouped in the following order:
standard library imports
related third party imports
local application/library specific imports
You should put a blank line between each group of imports.
You should follow this guide. I generally do it for Jupyter notebooks as well, with my first code cell having all of the imports I need. Therefore, going forward all of our lessons will have all necessary imports at the top of the document. The only exception is when we are explicitly demonstrating a concept that requires an import.
Imports and updates
Once you have imported a module or package, the interpreter stores its contents in memory. You cannot update the contents of the package and expect the interpreter to know about the changes. You will need to restart the kernel and then import the package again in a fresh instance.
This can seem annoying, but it is good design. It ensures that code you are running does not change as you go through executing a notebook. However, when developing modules, it is sometimes convenient to have an imported module be updated as you run through the notebook as you are editing. To enable this, you can use the autoreload extension. To activate it, run the following in a code cell.
%load_ext autoreload
%autoreload 2
Whenever you run a cell, imported packages and modules will be automatically reloaded.
Package management
Your workflows may require many packages. For example, building this bootcamp requires 336 packages! These packages depend on each other in various ways. For example, Pandas, a package we will use extensively, requires NumPy, python-dateutil, and pytz, plus loads of optional dependencies. To make matters complicated, different versions of various packages depend on specific versions of their dependencies, which can form a tangled web of requirements. How can we handle this mess?
Package management systems solve this problem. The package management system we are using is Conda. When you want to install a new package, you can do so with a command like
conda install the-package-i-want
and Conda will make sure that all of the version numbers line up, updating or downgrading already installed packages to accommodate the new one. Conda also plays nicely with pip
, which can also be used to install packages.
The smaller the set of packages you need to manage, the better. Therefore, Conda allows you to set up environments. Each environment contains a set of packages with versions in them. In lesson 0, you set up an environment for this bootcamp that we have been using. Using environments for your projects, as opposed to a single monolithic base environment that has tons and tons of packages, is advantageous for several reasons.
By keeping the number of packages limited to those you need, you can avoid version clashes.
Projects may require specific versions of packages, which can be explicitly installed. In other environments, you can use different versions.
You can encode your environment in a YAML file that you can share (as we did to set up the bootcamp environment). This allows your collaborators to readily set up environments mirroring yours and more easily share packages.
Point 3 is very useful for pedagogical applications. It is very convenient when all students and TAs have the same packages installed!
I will not go through the details of how to use Conda (or Mamba, a related package manager) here, but rather refer you to Conda’s extensive documentation.
Computing environment
[10]:
%load_ext watermark
%watermark -v -p numpy,jupyterlab
Python implementation: CPython
Python version : 3.11.3
IPython version : 8.12.0
numpy : 1.24.3
jupyterlab: 3.6.3