(c) 2016 Justin Bois and Shyam Saladi. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This exercise was generated from a Jupyter notebook. You can download the notebook here.
import re
import pytest
import numpy as np
import pandas as pd
In Lesson 6, we discussed iteration; now that you're masters of the for
loop, it probably seems like years ago. As part of the lesson, we developed code to search for a codon (a three base nucleotide sequence) with a provided sequence using a while loop. It's wrapped into a function below:
def find_codon_lesson6(codon, seq):
"""Find a specified codon with a given sequence."""
i = 0
# Scan sequence until we hit the start codon or the end of the sequence
while seq[i:i+3] != codon and i < len(seq):
i += 1
if i == len(seq):
return -1
return i
Let's use the the coding and protein sequence for an isoform of the fruit fly (D. melanogaster)'s synaptobrevin, a super important part of the SNARE complex in cells! See here for more detail. These sequences are directly from the NIH's National Center for Biotechnology Information, the largest repository of biological information around.
http://www.ncbi.nlm.nih.gov/nuccore/665400015?from=147&to=476&report=fasta
>gi|665400015:147-476 Drosophila melanogaster synaptobrevin, transcript variant D (Syb), mRNA
ATGGAGAACAACGAAGCCCCCTCCCCCTCGGGATCCAACAACAACGAGAACAACAATGCAGCCCAGAAGA
AGCTGCAGCAGACCCAAGCCAAGGTGGACGAGGTGGTCGGGATTATGCGTGTGAACGTGGAGAAGGTCCT
GGAGCGGGACCAGAAGCTATCGGAACTGGGCGAGCGTGCGGATCAGCTGGAGCAGGGAGCATCCCAGTTC
GAGCAGCAGGCCGGCAAGCTGAAGCGCAAGCAATGGTGGGCCAACATGAAGATGATGATCATTCTGGGCG
TGATAGCCGTTGTGCTGCTCATCATCGTTCTGGTGTCGCTTTTCAATTGA
http://www.ncbi.nlm.nih.gov/protein/221330128?report=fasta
>gi|221330128|ref|NP_001137634.1| synaptobrevin, isoform D [Drosophila melanogaster]
MENNEAPSPSGSNNNENNNAAQKKLQQTQAKVDEVVGIMRVNVEKVLERDQKLSELGERADQLEQGASQF
EQQAGKLKRKQWWANMKMMIILGVIAVVLLIIVLVSLFN
a) Does find_codon_lesson6()
work as you would like it to? To answer this question think about the following questions:
- Where does it find
ATG
in the nucleotide sequence provided above? Is the result correct?- How about
AAT
?- How about the codons
TGT
andTGT
? What residue do these last two codons encode (for your convenience, standard genetic code)?- Does this residue exist in the protein sequence above?
If you'd like, go ahead and use another tool to translate the nucleotide sequence to confirm yourself.
b) Unfortunately, at this point, you will have found a bug! Formalize this test by writing a new function test_find_codon(find_codon)
. It takes a single argument the function that is being tested. Write assert
statements to check if find_codon
returns the expected value for the 4 codons discussed above (ATG
, AAT
, TGT
, TGC
).
Remember, assert
statements will raise an exception if they evaluate to False
.
c) Code up a new function to find a specified codon within a given sequence by revising find_codon_lesson6
. Of course, you should test it with test_find_codon()
after it's written.
a)
synapto_nuc = (
"ATGGAGAACAACGAAGCCCCCTCCCCCTCGGGATCCAACAACAACGAGAACAACAATGCAGCCCAGAAGA"
"AGCTGCAGCAGACCCAAGCCAAGGTGGACGAGGTGGTCGGGATTATGCGTGTGAACGTGGAGAAGGTCCT"
"GGAGCGGGACCAGAAGCTATCGGAACTGGGCGAGCGTGCGGATCAGCTGGAGCAGGGAGCATCCCAGTTC"
"GAGCAGCAGGCCGGCAAGCTGAAGCGCAAGCAATGGTGGGCCAACATGAAGATGATGATCATTCTGGGCG"
"TGATAGCCGTTGTGCTGCTCATCATCGTTCTGGTGTCGCTTTTCAATTGA")
synapto_prot = (
"MENNEAPSPSGSNNNENNNAAQKKLQQTQAKVDEVVGIMRVNVEKVLERDQKLSELGERADQLEQGASQF"
"EQQAGKLKRKQWWANMKMMIILGVIAVVLLIIVLVSLFN")
print('ATG :', find_codon_lesson6('ATG', synapto_nuc))
print('AAT :', find_codon_lesson6('AAT', synapto_nuc))
print('TGT :', find_codon_lesson6('TGT', synapto_nuc))
print('TGC :', find_codon_lesson6('TGC', synapto_nuc))
To a first approximation, all of these results look right. The following is an excerpt of the standard genetic code:
Codon | Amino acid |
---|---|
ATG | M |
AAT | N |
TGT | C |
TGC | C |
Let's check if these actually exist in the protein sequence:
print("M : ", synapto_prot.find("M"))
print("N : ", synapto_prot.find("N"))
print("C : ", synapto_prot.find("C"))
No C
in the protein sequence! What happened? Let's separate the nucleotide sequence into codons to check. We can check this with a for
loop through the sequence, or we can use the wonderful magic of regular expressions (covered last year).
all_codons = re.findall('...', synapto_nuc)
print('ATG :', 'ATG' in all_codons)
print('AAT :', 'AAT' in all_codons)
print('TGT :', 'TGT' in all_codons)
print('TGC :', 'TGC' in all_codons)
b) We'll write the function with the synaptobrevin sequence as the test and run the test.
def test_find_codon(find_codon):
"""
A function to test another function that looks for a codon
within a coding sequence
"""
synapto_nuc = (
"ATGGAGAACAACGAAGCCCCCTCCCCCTCGGGATCCAACAACAACGAGAACAACAATGCAGCCCAGAAGA"
"AGCTGCAGCAGACCCAAGCCAAGGTGGACGAGGTGGTCGGGATTATGCGTGTGAACGTGGAGAAGGTCCT"
"GGAGCGGGACCAGAAGCTATCGGAACTGGGCGAGCGTGCGGATCAGCTGGAGCAGGGAGCATCCCAGTTC"
"GAGCAGCAGGCCGGCAAGCTGAAGCGCAAGCAATGGTGGGCCAACATGAAGATGATGATCATTCTGGGCG"
"TGATAGCCGTTGTGCTGCTCATCATCGTTCTGGTGTCGCTTTTCAATTGA")
assert find_codon('ATG', synapto_nuc) == 0
assert find_codon('AAT', synapto_nuc) == 54
assert find_codon('TGT', synapto_nuc) == -1
assert find_codon('TGC', synapto_nuc) == -1
return None
test_find_codon(find_codon_lesson6)
We get failure, as we'd expect. We now have a good test for a new function we would write.
c) We can now code up a new function to make sure that we are only looking at in register codons after the start codon.
def find_codon_new(codon, seq):
"""Find a specified codon with a given sequence
"""
i = 0
# Scan sequence until we hit the start codon or the end of the sequence
while seq[i:i+3] != codon and i < len(seq):
i += 3
if i == len(seq):
return -1
return i
test_find_codon(find_codon_new)
Hurray! It passed! We would of course want to write many more tests to make sure that the function is working properly, including edge cases. Edge cases are cases where inputs take extreme values (such an empty string for a sequence or something with no start codon).
In biochemistry, we often measure dissociation constants ($K_\mathrm{d}$) using various methods, usually involving titration. A dissociation constant is the equilibrium constant for the chemical reaction
\begin{align} \mathrm{ab} \rightleftharpoons \mathrm{a} + \mathrm{b}. \end{align}It is related to the equilibrium concentration of the species a, b, and ab by
\begin{align} K_\mathrm{d} = \frac{c_\mathrm{a}\,c_\mathrm{b}}{c_{\mathrm{ab}}}. \end{align}To analyze a titration curve, we often have to compute the equilibrium concentrations of the respective species given the parameters $K_\mathrm{d}$, $c_\mathrm{a}^0$, and $c_\mathrm{b}^0$, where the latter two parameters are the total concentrations of species a and b, respectively. This is typically known, as this is what the experimenter pipetted into the reaction solution.
As you probably worked out in your undergraduate general chemistry course, the respective equilibrium concentrations can be solved to be
\begin{align} c_\mathrm{ab} &= \frac{1}{2}\left(c_\mathrm{a}^0 + c_\mathrm{b}^0 + K_\mathrm{d} - \sqrt{\left(c_\mathrm{a}^0 + c_\mathrm{b}^0 + K_\mathrm{d}\right)^2 - 4c_\mathrm{a}^0 c_\mathrm{b}^0}\right), \\[1em] c_\mathrm{a} &= c_\mathrm{a}^0 - c_\mathrm{ab} \\[1em] c_\mathrm{b} &= c_\mathrm{b}^0 - c_\mathrm{ab} \end{align}Your job is to write a function that computes the three concentrations, $c_\mathrm{a}$, $c_\mathrm{b}$, and $c_\mathrm{ab}$ given inputs $K_\mathrm{d}$, $c_\mathrm{a}^0$, and $c_\mathrm{b}^0$. Use the principles of TDD to develop it from the start. That is, define the call signature, write your first test, and make it fail. Then incrementally build your tests with your function.
There is one thing interesting about the way you define your tests for this function. You can check that the solution satisfies the equations describing chemical equilibrium and conservation of mass.
\begin{align} K_\mathrm{d} &= \frac{c_\mathrm{a}\,c_\mathrm{b}}{c_{\mathrm{ab}}}, \\[1em] c_\mathrm{a}^0 &= c_\mathrm{a} + c_\mathrm{ab}, \\[1em] c_\mathrm{b}^0 &= c_\mathrm{b} + c_\mathrm{ab}. \end{align}Sometimes you have this luxury, a mathematical statement that must hold for the output of your function. This makes writing tests a bit easier and more intuitive. As a hint, for checking closeness NumPy arrays, use the np.allclose()
function.
We first specify the function call signature and write our first test. The trivial example is when all entries are zero. We make it fail for now.
def test_dissoc_equil():
"""Tests for dissoc_equil()."""
assert np.isclose(dissoc_equil(1, 0, 0), np.array([0, 0, 0])).all()
return None
def dissoc_equil(Kd, ca0, cb0):
"""Compute equilibrium for dissociation reaction."""
return np.array([1, 1, 1])
# Make it fail
test_dissoc_equil()
Now, we'll put together the function to return something useful.
def dissoc_equil(Kd, ca0, cb0):
"""Compute equilibrium for dissociation reaction."""
# Compute cab from quadratic formula
b = ca0 + cb0 + Kd
cab = (b - np.sqrt(b**2 - 4*ca0*cb0)) / 2
# Compute ca and cb from conservation of mass
ca = ca0 - cab
cb = cb0 - cab
return ca, cb, cab
Now, let's try our test again.
test_dissoc_equil()
It passed! Let's try writing some more test. Let's try another trivial one where every input is zero.
def test_dissoc_equil():
"""Tests for dissoc_equil()."""
assert np.allclose(dissoc_equil(1, 0, 0), np.array([0, 0, 0]))
assert np.allclose(dissoc_equil(0, 0, 0), np.array([0, 0, 0]))
return None
test_dissoc_equil()
Now, how about with $K_\mathrm{d} = 0$ and nonzero concentrations?
def test_dissoc_equil():
"""Tests for dissoc_equil()."""
assert np.allclose(dissoc_equil(1, 0, 0), np.array([0, 0, 0]))
assert np.allclose(dissoc_equil(0, 0, 0), np.array([0, 0, 0]))
assert np.allclose(dissoc_equil(0, 1, 1), np.array([0, 0, 1]))
assert np.allclose(dissoc_equil(0, 1, 2), np.array([0, 1, 1]))
assert np.allclose(dissoc_equil(0, 2, 1), np.array([1, 0, 1]))
return None
test_dissoc_equil()
Very good! Now, let's try the other extreme, where $K_\mathrm{d}$ is infinity. Infinity is represented by np.inf
.
def test_dissoc_equil():
"""Tests for dissoc_equil()."""
assert np.allclose(dissoc_equil(1, 0, 0), np.array([0, 0, 0]))
assert np.allclose(dissoc_equil(0, 0, 0), np.array([0, 0, 0]))
assert np.allclose(dissoc_equil(0, 1, 1), np.array([0, 0, 1]))
assert np.allclose(dissoc_equil(0, 1, 2), np.array([0, 1, 1]))
assert np.allclose(dissoc_equil(0, 2, 1), np.array([1, 0, 1]))
assert np.allclose(dissoc_equil(np.inf, 1, 1), np.array([1, 1, 0]))
return None
test_dissoc_equil()
Yikes! Failure. What happened?
dissoc_equil(np.inf, 1, 1)
The result is all NaN
s. This is because we are asking the function to subtract infinity from infinity. However, we know the result when $K_\mathrm{d}$ is infinite, and we can put in logic to check for that.
def dissoc_equil(Kd, ca0, cb0):
"""Compute equilibrium for dissociation reaction."""
# If we have infinite Kd
if Kd == np.inf:
return ca0, cb0, 0
# Compute cab from quadratic formula
b = ca0 + cb0 + Kd
cab = (b - np.sqrt(b**2 - 4*ca0*cb0)) / 2
# Compute ca and cb from conservation of mass
ca = ca0 - cab
cb = cb0 - cab
return ca, cb, cab
Now, let's re-run the tests.
test_dissoc_equil()
Better! Now that we have come across this issue, we should have error checks to make sure the input is legit. E.g., all nonnegative, and $c_\mathrm{a}^0$ and $c_\mathrm{b}^0$ finite. We'll make tests and fail.
def test_dissoc_equil():
"""Tests for dissoc_equil()."""
assert np.allclose(dissoc_equil(1, 0, 0), np.array([0, 0, 0]))
assert np.allclose(dissoc_equil(0, 0, 0), np.array([0, 0, 0]))
assert np.allclose(dissoc_equil(0, 1, 1), np.array([0, 0, 1]))
assert np.allclose(dissoc_equil(0, 1, 2), np.array([0, 1, 1]))
assert np.allclose(dissoc_equil(0, 2, 1), np.array([1, 0, 1]))
assert np.allclose(dissoc_equil(np.inf, 1, 1), np.array([1, 1, 0]))
pytest.raises(RuntimeError, "dissoc_equil(-1, 1, 1)")
pytest.raises(RuntimeError, "dissoc_equil(1, -1, 1)")
pytest.raises(RuntimeError, "dissoc_equil(1, 1, -1)")
pytest.raises(RuntimeError, "dissoc_equil(1, np.inf, 1)")
pytest.raises(RuntimeError, "dissoc_equil(1, 1, np.inf)")
return None
test_dissoc_equil()
And now we can adjust the function.
def dissoc_equil(Kd, ca0, cb0):
"""Compute equilibrium for dissociation reaction."""
# Check input
if Kd < 0 or ca0 < 0 or cb0 < 0:
raise RuntimeError('All input must be nonnegative.')
if not (ca0 < np.inf and cb0 < np.inf):
raise RuntimeError('Input concentrations must be finite.')
# If we have infinite Kd
if Kd == np.inf:
return ca0, cb0, 0
# Compute cab from quadratic formula
b = ca0 + cb0 + Kd
cab = (b - np.sqrt(b**2 - 4*ca0*cb0)) / 2
# Compute ca and cb from conservation of mass
ca = ca0 - cab
cb = cb0 - cab
return ca, cb, cab
test_dissoc_equil()
Passage of all! We have caught many of the edge and corner cases. Now, to check every day calculations, we can just run the calculation and then verify that the equilibrium and conservation of mass expressions hold. We'll first write a function to check that.
def check_eq(Kd, ca0, cb0, ca, cb, cab):
"""Verify equilibrium expressions hold."""
eq = np.isclose(Kd, ca * cb / cab)
cons_mass_A = np.isclose(ca0, ca + cab)
cons_mass_B = np.isclose(cb0, cb + cab)
return eq and cons_mass_A and cons_mass_B
Now, we can incorporate this into our tests.
def test_dissoc_equil():
"""Tests for dissoc_equil()."""
# Edge cases
assert np.allclose(dissoc_equil(1, 0, 0), np.array([0, 0, 0]))
assert np.allclose(dissoc_equil(0, 0, 0), np.array([0, 0, 0]))
assert np.allclose(dissoc_equil(0, 1, 1), np.array([0, 0, 1]))
assert np.allclose(dissoc_equil(0, 1, 2), np.array([0, 1, 1]))
assert np.allclose(dissoc_equil(0, 2, 1), np.array([1, 0, 1]))
assert np.allclose(dissoc_equil(np.inf, 1, 1), np.array([1, 1, 0]))
# Standard cases
assert check_eq(4.5, 0.1, 3.2, *dissoc_equil(4.5, 0.1, 3.2))
assert check_eq(0.14, 0.00011, 0.003, *dissoc_equil(0.14, 0.00011, 0.003))
# Errors
pytest.raises(RuntimeError, "dissoc_equil(-1, 1, 1)")
pytest.raises(RuntimeError, "dissoc_equil(1, -1, 1)")
pytest.raises(RuntimeError, "dissoc_equil(1, 1, -1)")
pytest.raises(RuntimeError, "dissoc_equil(1, np.inf, 1)")
pytest.raises(RuntimeError, "dissoc_equil(1, 1, np.inf)")
return None
test_dissoc_equil()
Looks good! To be more thorough with out tests, we should test over a logarithmic range of dissociation constants in initial concentrations.
def test_dissoc_equil():
"""Tests for dissoc_equil()."""
# Edge cases
assert np.allclose(dissoc_equil(1, 0, 0), np.array([0, 0, 0]))
assert np.allclose(dissoc_equil(0, 0, 0), np.array([0, 0, 0]))
assert np.allclose(dissoc_equil(0, 1, 1), np.array([0, 0, 1]))
assert np.allclose(dissoc_equil(0, 1, 2), np.array([0, 1, 1]))
assert np.allclose(dissoc_equil(0, 2, 1), np.array([1, 0, 1]))
assert np.allclose(dissoc_equil(np.inf, 1, 1), np.array([1, 1, 0]))
# Standard cases
Kd_vals = np.logspace(-10, 1, 50)
ca0_vals = np.logspace(-5, 2, 50)
cb0_vals = np.logspace(-5, 2, 50)
for Kd in Kd_vals:
for ca0 in ca0_vals:
for cb0 in cb0_vals:
assert check_eq(Kd, ca0, cb0, *dissoc_equil(Kd, ca0, cb0)), \
'Kd = %g, ca0 = %g, cb0 = %g' % (Kd, ca0, cb0)
# Errors
pytest.raises(RuntimeError, "dissoc_equil(-1, 1, 1)")
pytest.raises(RuntimeError, "dissoc_equil(1, -1, 1)")
pytest.raises(RuntimeError, "dissoc_equil(1, 1, -1)")
pytest.raises(RuntimeError, "dissoc_equil(1, np.inf, 1)")
pytest.raises(RuntimeError, "dissoc_equil(1, 1, np.inf)")
return None
test_dissoc_equil()
We get an error when the difference in the total concentrations is too large. This highlights a numerical stability issue. When this is the case, we should find a more numerically stable way to compute the result. This is not a trivial matter, and something that requires careful thought. We will not develop a more numerically stable method here, but rather stop here having effectively shown how careful tests can reveal (possibly unexpected) weaknesses in otherwise simple code.