(c) 2018 Justin Bois and Davi Ortega. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This lesson was generated from a Jupyter notebook. You can download the notebook here.
To explore another feature of pytest
, we'll consider another aspect of our number_negatives()
function. Specifically, what should we do if an invalid sequence is entered? A sensible thing to do in this case is to make our software throw a RuntimeError
.
Again, in designing our test, we need to think about what constitutes an invalid sequence. We'll only allow the 20 symbols for the residues that we used in previous lessons and present in the bootcamp_utils.aa
dictionary. So, we adjust our test function accordingly. We cannot use the assert
statement to check for proper error handling, so we use the pytest.raises()
function. This function takes as its first argument the type of exception expected, and a string containing the code to be run to give the exception.
It is important to draw the distinction between assertions and raising exceptions in our code.
We should then add to the code of the test_seq_features.py
to include our expectation that the program should throw a RuntimeError
if an invalid sequence is entered:
def test_number_negatives_for_invalid_amino_acid():
with pytest.raises(RuntimeError) as excinfo:
seq_features.number_negatives('Z')
excinfo.match("Z is not a valid amino acid")
We also have to include import pytest
at the beginning of the test_seq_features.py
file. It is clear that if Z
is passed as the input sequence, the program should throw a RuntimeError
saying: "Z is an invalid sequence". Let's test:
!pytest -v
Despite that all other four tests still pass, the last one fails because our program does not know yet how to throw a RuntimeError
when it receives an invalid sequence as input. Let's fix that. Adjust the function in the seq_features.py
file to be as follows.
def number_negatives(seq):
"""Number of negative residues a protein sequence"""
# Convert sequence to upper case
seq = seq.upper()
if seq == 'Z':
raise RuntimeError('Z is not a valid amino acid.')
# Count E's and D's, since these are the negative residues
return seq.count('E') + seq.count('D')
Now, re-running the test...
!pytest -v
Obviously, this is not a very robust fix; it only works if the invalid amino acid is Z
. We need a smarter way to fix this. What about using the bootcamp_utils.aa
dictionary from before? Adjust the contents of your seq_features.py
file as follows.
import bootcamp_utils
def number_negatives(seq):
"""Number of negative residues a protein sequence"""
# Convert sequence to upper case
seq = seq.upper()
# Check for a valid sequence
for aa in seq:
if aa not in bootcamp_utils.aa.keys():
raise RuntimeError(aa + ' is not a valid amino acid.')
# Count E's and D's, since these are the negative residues
return seq.count('E') + seq.count('D')
Now let's run pytest
one more time.
!pytest -v
Hurray! Everything passed beautifully.
Now that you have some experience with TDD and have an idea about what it is and how it works, let's formalize things by writing out the basic principles of test-driven development.
Let's write now a function that will calculate the total number of positive charges. In other words, let's count the number of Lysine (K), Arginine (R) and Histidine (H) residues in the sequence.
To do that, let's make the prototype function and add to seq_features.py
:
def number_positives(seq):
"""Number of positive residues a protein sequence"""
pass
and now, let's build a simple test and add it to test_seq_feature.py
def test_number_positives_single_R_K_or_H():
"""Perform unit tests on number_positives for single AA"""
assert seq_features.number_positives('R') == 1
assert seq_features.number_positives('K') == 1
assert seq_features.number_positives('H') == 1
and let's test.
!pytest -v
Let's fix our function.
def number_positives(seq):
"""Number of positive residues a protein sequence"""
# Count R's, K's and H's, since these are the positive residues
return seq.count('R') + seq.count('K') + seq.count('H')
And test again...
!pytest -v
Now, obviously we want the number_positives()
function to behave like the number_negatives()
with weird cases, let's add the tests below to test_seq_features.py
.
def test_number_positives_for_empty():
"""Perform unit tests on number_positives for empty entry"""
assert seq_features.number_positives('') == 0
def test_number_positives_for_short_sequences():
"""Perform unit tests on number_positives for short sequence"""
assert seq_features.number_positives('RCKLWTTRE') == 3
assert seq_features.number_positives('DDDDEEEE') == 0
def test_number_positives_for_lowercase():
"""Perform unit tests on number_positives for lowercase"""
assert seq_features.number_positives('rcklwttre') == 3
def test_number_positives_for_invalid_amino_acid():
with pytest.raises(RuntimeError) as excinfo:
seq_features.number_positives('Z')
excinfo.match("Z is not a valid amino acid")
Let's test it.
!pytest -v
Although the current version of the function number_positives()
passes most of the tests, it is not ready to handle to the edge cases (lowercases and invalid amino-acids).
We can fix that easily; let's update the number_positives()
...
def number_positives(seq):
"""Number of positive residues a protein sequence"""
# Convert sequence to upper case
seq = seq.upper()
# Check for a valid sequence
for aa in seq:
if aa not in bootcamp_utils.aa.keys():
raise RuntimeError(aa + ' is not a valid amino acid.')
return seq.count('R') + seq.count('K') + seq.count('H')
...and run the test one more time:
!pytest -v
As we are building modules and functions, we are not able to anticipate all the functionalities they must have. And by adding new functionalities, we might need to change our code substantially and even dramatically change the initial logic that worked so well up to this point. This is so common in programming that developers have a name for it: code refactoring.
For example, we did not anticipate when we start writing seq_features
that we also wanted to calculate the positive charges as well. And now that we have two functions we broke one of the most important rule in programming: functions must do one thing and only one thing very well. It is clear that number_negatives()
was doing three things:
Turns out that number_positives()
also needs to do items 1 and 2, and because of that we have repeated the following lines of code in two different functions, within the same module:
# Convert sequence to upper case
seq = seq.upper()
# Check for a valid sequence
for aa in seq:
if aa not in bootcamp_utils.aa.keys():
raise RuntimeError(aa + ' is not a valid amino acid.')
and if we are trying to make this module more robust, every time we catch a bug, we will need to change identical code in two places. So let's perform a code refactoring in order to keep the principle of functions doing only one thing as close to the truth as possible.
The first task, changing the inputted sequence to uppercase, uses a built-in Python function, and using another function to do this is unnessary. So, we can keep the seq = seq.upper()
line in the functions.
Now, let's write a functions that will check if the sequence is valid. That way we will focus all the logic related to checking for invalid sequence in one part of the code, and we can call it anywhere we need afterwards. So, your module seq_features.py
should look like this:
import bootcamp_utils
def is_valid_sequence(seq):
for aa in seq:
if aa not in bootcamp_utils.aa.keys():
raise RuntimeError(aa + ' is not a valid amino acid.')
def number_negatives(seq):
"""Number of negative residues a protein sequence"""
# Convert sequence to upper case
seq = seq.upper()
# Check for a valid sequence
is_valid_sequence(seq)
# Count E's and D's, since these are the negative residues
return seq.count('E') + seq.count('D')
def number_positives(seq):
"""Number of positive residues a protein sequence"""
# Convert sequence to upper case
seq = seq.upper()
# Check for a valid sequence
is_valid_sequence(seq)
return seq.count('R') + seq.count('K') + seq.count('H')
Now let's include a two new tests to test_seq_features.py
.
def test_number_negatives_for_invalid_amino_acid_anywhere():
with pytest.raises(RuntimeError) as excinfo:
seq_features.number_negatives('AZK')
excinfo.match("Z is not a valid amino acid")
def test_number_positives_for_invalid_amino_acid_anywhere():
with pytest.raises(RuntimeError) as excinfo:
seq_features.number_positives('AZK')
excinfo.match("Z is not a valid amino acid")
!pytest -v
There we have it. Passing all the tests and even though we changed our code to accommodate new demands, we can guarantee that it is still working the way it was first intended in additional to the new functionalities.
In addition, we don't need to write tests related to valid sequence for number_negatives()
and number_positives()
because these functions are not suppose to be responsible for this task anymore.
Important note: Refactoring tests is frowned upon and taken VERY seriously by developers; it is a very big responsibility and should be done carefully if ever. Keep on adding tests related to is_valid_sequence()
, but do not remove the previous tests already in the suite.
So, let's add the exception tests for is_valid_sequence()
in test_seq_features.py
:
def test_is_valid_sequence_for_invalid_amino_acid():
with pytest.raises(RuntimeError) as excinfo:
seq_features.is_valid_sequence('Z')
excinfo.match("Z is not a valid amino acid")
def test_is_valid_sequence_for_invalid_amino_acid_anywhere():
with pytest.raises(RuntimeError) as excinfo:
seq_features.is_valid_sequence('AZK')
excinfo.match("Z is not a valid amino acid")
and test it:
!pytest -v
We should write more careful tests for is_valid_sequence()
to cover more possible errors than just having a Z
in a sequence. This is nive, now we just need to code a single test function for it, in contrast to writing two of them: one for number_negatives()
and another for number_positives()
.
First we add some tests.
def test_is_valid_sequence_for_other_invalid_amino_acid_anywhere():
assert seq_features.is_valid_sequence('ALKSAYGS') is None
with pytest.raises(RuntimeError) as excinfo:
seq_features.is_valid_sequence('AZLL')
excinfo.match("Z is not a valid amino acid")
with pytest.raises(RuntimeError) as excinfo:
seq_features.is_valid_sequence('ALLBJ')
excinfo.match("B is not a valid amino acid")
with pytest.raises(RuntimeError) as excinfo:
seq_features.is_valid_sequence('AL%J')
excinfo.match("% is not a valid amino acid")
And let's run them!
!pytest -v
There are tons of details about pytest
that will address most issues you will encounter while working on your program. It is very well documented, so you can use that to develop tests for your code.
This lesson was initially heavily based on Katy Huff's Software Carpentry Tutorial and our own experience. Check it out for more details on TDD.
Finally, the next real step is for you to learn continuous integration (CI) and how to package your program and publish it (possibly on the Python Package Index, or just hosted on GitHub). An interesting shortcut for that is to use the Cookiecutter package.
Please few free to contact us if you want more information about CI. Have fun, and always test and share your code.