Lesson 36: Practice TDD solutions

(c) 2016 Justin Bois and Shyam Saladi. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This exercise was generated from a Jupyter notebook. You can download the notebook here.

In [1]:
import re
import pytest

import numpy as np
import pandas as pd

Practice 1: Identifying codons within nucleotide sequences

In Lesson 6, we discussed iteration; now that you're masters of the for loop, it probably seems like years ago. As part of the lesson, we developed code to search for a codon (a three base nucleotide sequence) with a provided sequence using a while loop. It's wrapped into a function below:

In [2]:
def find_codon_lesson6(codon, seq):
    """Find a specified codon with a given sequence.""" 
    
    i = 0
    # Scan sequence until we hit the start codon or the end of the sequence
    while seq[i:i+3] != codon and i < len(seq):
        i += 1
    
    if i == len(seq):
        return -1
    
    return i

Let's use the the coding and protein sequence for an isoform of the fruit fly (D. melanogaster)'s synaptobrevin, a super important part of the SNARE complex in cells! See here for more detail. These sequences are directly from the NIH's National Center for Biotechnology Information, the largest repository of biological information around.

http://www.ncbi.nlm.nih.gov/nuccore/665400015?from=147&to=476&report=fasta

>gi|665400015:147-476 Drosophila melanogaster synaptobrevin, transcript variant D (Syb), mRNA
ATGGAGAACAACGAAGCCCCCTCCCCCTCGGGATCCAACAACAACGAGAACAACAATGCAGCCCAGAAGA
AGCTGCAGCAGACCCAAGCCAAGGTGGACGAGGTGGTCGGGATTATGCGTGTGAACGTGGAGAAGGTCCT
GGAGCGGGACCAGAAGCTATCGGAACTGGGCGAGCGTGCGGATCAGCTGGAGCAGGGAGCATCCCAGTTC
GAGCAGCAGGCCGGCAAGCTGAAGCGCAAGCAATGGTGGGCCAACATGAAGATGATGATCATTCTGGGCG
TGATAGCCGTTGTGCTGCTCATCATCGTTCTGGTGTCGCTTTTCAATTGA

http://www.ncbi.nlm.nih.gov/protein/221330128?report=fasta

>gi|221330128|ref|NP_001137634.1| synaptobrevin, isoform D [Drosophila melanogaster]
MENNEAPSPSGSNNNENNNAAQKKLQQTQAKVDEVVGIMRVNVEKVLERDQKLSELGERADQLEQGASQF
EQQAGKLKRKQWWANMKMMIILGVIAVVLLIIVLVSLFN

a) Does find_codon_lesson6() work as you would like it to? To answer this question think about the following questions:

  1. Where does it find ATG in the nucleotide sequence provided above? Is the result correct?
  2. How about AAT?
  3. How about the codons TGT and TGT? What residue do these last two codons encode (for your convenience, standard genetic code)?
  4. Does this residue exist in the protein sequence above?

If you'd like, go ahead and use another tool to translate the nucleotide sequence to confirm yourself.

b) Unfortunately, at this point, you will have found a bug! Formalize this test by writing a new function test_find_codon(find_codon). It takes a single argument the function that is being tested. Write assert statements to check if find_codon returns the expected value for the 4 codons discussed above (ATG, AAT, TGT, TGC).

Remember, assert statements will raise an exception if they evaluate to False.

c) Code up a new function to find a specified codon within a given sequence by revising find_codon_lesson6. Of course, you should test it with test_find_codon() after it's written.

Practice 1: solution

a)

In [3]:
synapto_nuc = (
    "ATGGAGAACAACGAAGCCCCCTCCCCCTCGGGATCCAACAACAACGAGAACAACAATGCAGCCCAGAAGA"
    "AGCTGCAGCAGACCCAAGCCAAGGTGGACGAGGTGGTCGGGATTATGCGTGTGAACGTGGAGAAGGTCCT"
    "GGAGCGGGACCAGAAGCTATCGGAACTGGGCGAGCGTGCGGATCAGCTGGAGCAGGGAGCATCCCAGTTC"
    "GAGCAGCAGGCCGGCAAGCTGAAGCGCAAGCAATGGTGGGCCAACATGAAGATGATGATCATTCTGGGCG"
    "TGATAGCCGTTGTGCTGCTCATCATCGTTCTGGTGTCGCTTTTCAATTGA")

synapto_prot = (
    "MENNEAPSPSGSNNNENNNAAQKKLQQTQAKVDEVVGIMRVNVEKVLERDQKLSELGERADQLEQGASQF"
    "EQQAGKLKRKQWWANMKMMIILGVIAVVLLIIVLVSLFN")


print('ATG :', find_codon_lesson6('ATG', synapto_nuc))
print('AAT :', find_codon_lesson6('AAT', synapto_nuc))
print('TGT :', find_codon_lesson6('TGT', synapto_nuc))
print('TGC :', find_codon_lesson6('TGC', synapto_nuc))
ATG : 0
AAT : 54
TGT : 119
TGC : 56

To a first approximation, all of these results look right. The following is an excerpt of the standard genetic code:

Codon Amino acid
ATG M
AAT N
TGT C
TGC C

Let's check if these actually exist in the protein sequence:

In [4]:
print("M : ", synapto_prot.find("M"))
print("N : ", synapto_prot.find("N"))
print("C : ", synapto_prot.find("C"))
M :  0
N :  2
C :  -1

No C in the protein sequence! What happened? Let's separate the nucleotide sequence into codons to check. We can check this with a for loop through the sequence, or we can use the wonderful magic of regular expressions (covered last year).

In [5]:
all_codons = re.findall('...', synapto_nuc)
print('ATG :', 'ATG' in all_codons)
print('AAT :', 'AAT' in all_codons)
print('TGT :', 'TGT' in all_codons)
print('TGC :', 'TGC' in all_codons)
ATG : True
AAT : True
TGT : False
TGC : False

b) We'll write the function with the synaptobrevin sequence as the test and run the test.

In [6]:
def test_find_codon(find_codon):
    """
    A function to test another function that looks for a codon 
    within a coding sequence
    """
    synapto_nuc = (
        "ATGGAGAACAACGAAGCCCCCTCCCCCTCGGGATCCAACAACAACGAGAACAACAATGCAGCCCAGAAGA"
        "AGCTGCAGCAGACCCAAGCCAAGGTGGACGAGGTGGTCGGGATTATGCGTGTGAACGTGGAGAAGGTCCT"
        "GGAGCGGGACCAGAAGCTATCGGAACTGGGCGAGCGTGCGGATCAGCTGGAGCAGGGAGCATCCCAGTTC"
        "GAGCAGCAGGCCGGCAAGCTGAAGCGCAAGCAATGGTGGGCCAACATGAAGATGATGATCATTCTGGGCG"
        "TGATAGCCGTTGTGCTGCTCATCATCGTTCTGGTGTCGCTTTTCAATTGA")
    
    assert find_codon('ATG', synapto_nuc) == 0
    assert find_codon('AAT', synapto_nuc) == 54
    assert find_codon('TGT', synapto_nuc) == -1
    assert find_codon('TGC', synapto_nuc) == -1
    
    return None

test_find_codon(find_codon_lesson6)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-6-3c5b52ad198c> in <module>()
     18     return None
     19 
---> 20 test_find_codon(find_codon_lesson6)

<ipython-input-6-3c5b52ad198c> in test_find_codon(find_codon)
     13     assert find_codon('ATG', synapto_nuc) == 0
     14     assert find_codon('AAT', synapto_nuc) == 54
---> 15     assert find_codon('TGT', synapto_nuc) == -1
     16     assert find_codon('TGC', synapto_nuc) == -1
     17 

AssertionError: 

We get failure, as we'd expect. We now have a good test for a new function we would write.

c) We can now code up a new function to make sure that we are only looking at in register codons after the start codon.

In [7]:
def find_codon_new(codon, seq):
    """Find a specified codon with a given sequence
    """ 
    i = 0
    # Scan sequence until we hit the start codon or the end of the sequence
    while seq[i:i+3] != codon and i < len(seq):
        i += 3
    
    if i == len(seq):
        return -1
    
    return i

test_find_codon(find_codon_new)

Hurray! It passed! We would of course want to write many more tests to make sure that the function is working properly, including edge cases. Edge cases are cases where inputs take extreme values (such an empty string for a sequence or something with no start codon).

Practice 2: A chemical equilibrium calculator

In biochemistry, we often measure dissociation constants ($K_\mathrm{d}$) using various methods, usually involving titration. A dissociation constant is the equilibrium constant for the chemical reaction

\begin{align} \mathrm{ab} \rightleftharpoons \mathrm{a} + \mathrm{b}. \end{align}

It is related to the equilibrium concentration of the species a, b, and ab by

\begin{align} K_\mathrm{d} = \frac{c_\mathrm{a}\,c_\mathrm{b}}{c_{\mathrm{ab}}}. \end{align}

To analyze a titration curve, we often have to compute the equilibrium concentrations of the respective species given the parameters $K_\mathrm{d}$, $c_\mathrm{a}^0$, and $c_\mathrm{b}^0$, where the latter two parameters are the total concentrations of species a and b, respectively. This is typically known, as this is what the experimenter pipetted into the reaction solution.

As you probably worked out in your undergraduate general chemistry course, the respective equilibrium concentrations can be solved to be

\begin{align} c_\mathrm{ab} &= \frac{1}{2}\left(c_\mathrm{a}^0 + c_\mathrm{b}^0 + K_\mathrm{d} - \sqrt{\left(c_\mathrm{a}^0 + c_\mathrm{b}^0 + K_\mathrm{d}\right)^2 - 4c_\mathrm{a}^0 c_\mathrm{b}^0}\right), \\[1em] c_\mathrm{a} &= c_\mathrm{a}^0 - c_\mathrm{ab} \\[1em] c_\mathrm{b} &= c_\mathrm{b}^0 - c_\mathrm{ab} \end{align}

Your job is to write a function that computes the three concentrations, $c_\mathrm{a}$, $c_\mathrm{b}$, and $c_\mathrm{ab}$ given inputs $K_\mathrm{d}$, $c_\mathrm{a}^0$, and $c_\mathrm{b}^0$. Use the principles of TDD to develop it from the start. That is, define the call signature, write your first test, and make it fail. Then incrementally build your tests with your function.

There is one thing interesting about the way you define your tests for this function. You can check that the solution satisfies the equations describing chemical equilibrium and conservation of mass.

\begin{align} K_\mathrm{d} &= \frac{c_\mathrm{a}\,c_\mathrm{b}}{c_{\mathrm{ab}}}, \\[1em] c_\mathrm{a}^0 &= c_\mathrm{a} + c_\mathrm{ab}, \\[1em] c_\mathrm{b}^0 &= c_\mathrm{b} + c_\mathrm{ab}. \end{align}

Sometimes you have this luxury, a mathematical statement that must hold for the output of your function. This makes writing tests a bit easier and more intuitive. As a hint, for checking closeness NumPy arrays, use the np.allclose() function.

Practice 2: solution

We first specify the function call signature and write our first test. The trivial example is when all entries are zero. We make it fail for now.

In [8]:
def test_dissoc_equil():
    """Tests for dissoc_equil()."""
    
    assert np.isclose(dissoc_equil(1, 0, 0), np.array([0, 0, 0])).all()
    
    return None
    
def dissoc_equil(Kd, ca0, cb0):
    """Compute equilibrium for dissociation reaction."""
    return np.array([1, 1, 1])

# Make it fail
test_dissoc_equil()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-8-62fa2af876c9> in <module>()
     11 
     12 # Make it fail
---> 13 test_dissoc_equil()

<ipython-input-8-62fa2af876c9> in test_dissoc_equil()
      2     """Tests for dissoc_equil()."""
      3 
----> 4     assert np.isclose(dissoc_equil(1, 0, 0), np.array([0, 0, 0])).all()
      5 
      6     return None

AssertionError: 

Now, we'll put together the function to return something useful.

In [9]:
def dissoc_equil(Kd, ca0, cb0):
    """Compute equilibrium for dissociation reaction."""
    # Compute cab from quadratic formula
    b = ca0 + cb0 + Kd
    cab = (b - np.sqrt(b**2 - 4*ca0*cb0)) / 2

    # Compute ca and cb from conservation of mass
    ca = ca0 - cab
    cb = cb0 - cab
    
    return ca, cb, cab

Now, let's try our test again.

In [10]:
test_dissoc_equil()

It passed! Let's try writing some more test. Let's try another trivial one where every input is zero.

In [11]:
def test_dissoc_equil():
    """Tests for dissoc_equil()."""
    
    assert np.allclose(dissoc_equil(1, 0, 0), np.array([0, 0, 0]))
    assert np.allclose(dissoc_equil(0, 0, 0), np.array([0, 0, 0]))
    
    return None
    
test_dissoc_equil()

Now, how about with $K_\mathrm{d} = 0$ and nonzero concentrations?

In [12]:
def test_dissoc_equil():
    """Tests for dissoc_equil()."""
    
    assert np.allclose(dissoc_equil(1, 0, 0), np.array([0, 0, 0]))
    assert np.allclose(dissoc_equil(0, 0, 0), np.array([0, 0, 0]))
    assert np.allclose(dissoc_equil(0, 1, 1), np.array([0, 0, 1]))
    assert np.allclose(dissoc_equil(0, 1, 2), np.array([0, 1, 1]))
    assert np.allclose(dissoc_equil(0, 2, 1), np.array([1, 0, 1]))

    return None
    
test_dissoc_equil()

Very good! Now, let's try the other extreme, where $K_\mathrm{d}$ is infinity. Infinity is represented by np.inf.

In [13]:
def test_dissoc_equil():
    """Tests for dissoc_equil()."""
    
    assert np.allclose(dissoc_equil(1, 0, 0), np.array([0, 0, 0]))
    assert np.allclose(dissoc_equil(0, 0, 0), np.array([0, 0, 0]))
    assert np.allclose(dissoc_equil(0, 1, 1), np.array([0, 0, 1]))
    assert np.allclose(dissoc_equil(0, 1, 2), np.array([0, 1, 1]))
    assert np.allclose(dissoc_equil(0, 2, 1), np.array([1, 0, 1]))
    assert np.allclose(dissoc_equil(np.inf, 1, 1), np.array([1, 1, 0]))

    return None

test_dissoc_equil()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-13-31f1a7019231> in <module>()
     11     return None
     12 
---> 13 test_dissoc_equil()

<ipython-input-13-31f1a7019231> in test_dissoc_equil()
      7     assert np.allclose(dissoc_equil(0, 1, 2), np.array([0, 1, 1]))
      8     assert np.allclose(dissoc_equil(0, 2, 1), np.array([1, 0, 1]))
----> 9     assert np.allclose(dissoc_equil(np.inf, 1, 1), np.array([1, 1, 0]))
     10 
     11     return None

AssertionError: 

Yikes! Failure. What happened?

In [14]:
dissoc_equil(np.inf, 1, 1)
Out[14]:
(nan, nan, nan)

The result is all NaNs. This is because we are asking the function to subtract infinity from infinity. However, we know the result when $K_\mathrm{d}$ is infinite, and we can put in logic to check for that.

In [15]:
def dissoc_equil(Kd, ca0, cb0):
    """Compute equilibrium for dissociation reaction."""
    # If we have infinite Kd
    if Kd == np.inf:
        return ca0, cb0, 0
    
    # Compute cab from quadratic formula
    b = ca0 + cb0 + Kd
    cab = (b - np.sqrt(b**2 - 4*ca0*cb0)) / 2

    # Compute ca and cb from conservation of mass
    ca = ca0 - cab
    cb = cb0 - cab
    
    return ca, cb, cab

Now, let's re-run the tests.

In [16]:
test_dissoc_equil()

Better! Now that we have come across this issue, we should have error checks to make sure the input is legit. E.g., all nonnegative, and $c_\mathrm{a}^0$ and $c_\mathrm{b}^0$ finite. We'll make tests and fail.

In [17]:
def test_dissoc_equil():
    """Tests for dissoc_equil()."""
    
    assert np.allclose(dissoc_equil(1, 0, 0), np.array([0, 0, 0]))
    assert np.allclose(dissoc_equil(0, 0, 0), np.array([0, 0, 0]))
    assert np.allclose(dissoc_equil(0, 1, 1), np.array([0, 0, 1]))
    assert np.allclose(dissoc_equil(0, 1, 2), np.array([0, 1, 1]))
    assert np.allclose(dissoc_equil(0, 2, 1), np.array([1, 0, 1]))
    assert np.allclose(dissoc_equil(np.inf, 1, 1), np.array([1, 1, 0]))

    pytest.raises(RuntimeError, "dissoc_equil(-1, 1, 1)")
    pytest.raises(RuntimeError, "dissoc_equil(1, -1, 1)")
    pytest.raises(RuntimeError, "dissoc_equil(1, 1, -1)")
    pytest.raises(RuntimeError, "dissoc_equil(1, np.inf, 1)")
    pytest.raises(RuntimeError, "dissoc_equil(1, 1, np.inf)")
    
    return None

test_dissoc_equil()
---------------------------------------------------------------------------
Failed                                    Traceback (most recent call last)
<ipython-input-17-9cf14cf10265> in <module>()
     17     return None
     18 
---> 19 test_dissoc_equil()

<ipython-input-17-9cf14cf10265> in test_dissoc_equil()
      9     assert np.allclose(dissoc_equil(np.inf, 1, 1), np.array([1, 1, 0]))
     10 
---> 11     pytest.raises(RuntimeError, "dissoc_equil(-1, 1, 1)")
     12     pytest.raises(RuntimeError, "dissoc_equil(1, -1, 1)")
     13     pytest.raises(RuntimeError, "dissoc_equil(1, 1, -1)")

/Users/Justin/anaconda/lib/python3.5/site-packages/_pytest/python.py in raises(expected_exception, *args, **kwargs)
   1319         except expected_exception:
   1320             return _pytest._code.ExceptionInfo()
-> 1321     pytest.fail("DID NOT RAISE {0}".format(expected_exception))
   1322 
   1323 class RaisesContext(object):

/Users/Justin/anaconda/lib/python3.5/site-packages/_pytest/runner.py in fail(msg, pytrace)
    484     """
    485     __tracebackhide__ = True
--> 486     raise Failed(msg=msg, pytrace=pytrace)
    487 fail.Exception = Failed
    488 

Failed: DID NOT RAISE <class 'RuntimeError'>

And now we can adjust the function.

In [18]:
def dissoc_equil(Kd, ca0, cb0):
    """Compute equilibrium for dissociation reaction."""
    
    # Check input
    if Kd < 0 or ca0 < 0 or cb0 < 0:
        raise RuntimeError('All input must be nonnegative.')
    if not (ca0 < np.inf and cb0 < np.inf):
        raise RuntimeError('Input concentrations must be finite.')
    
    # If we have infinite Kd
    if Kd == np.inf:
        return ca0, cb0, 0
    
    # Compute cab from quadratic formula
    b = ca0 + cb0 + Kd
    cab = (b - np.sqrt(b**2 - 4*ca0*cb0)) / 2

    # Compute ca and cb from conservation of mass
    ca = ca0 - cab
    cb = cb0 - cab
    
    return ca, cb, cab


test_dissoc_equil()

Passage of all! We have caught many of the edge and corner cases. Now, to check every day calculations, we can just run the calculation and then verify that the equilibrium and conservation of mass expressions hold. We'll first write a function to check that.

In [19]:
def check_eq(Kd, ca0, cb0, ca, cb, cab):
    """Verify equilibrium expressions hold."""
    eq = np.isclose(Kd, ca * cb / cab)
    cons_mass_A = np.isclose(ca0, ca + cab)
    cons_mass_B = np.isclose(cb0, cb + cab)
    
    return eq and cons_mass_A and cons_mass_B

Now, we can incorporate this into our tests.

In [20]:
def test_dissoc_equil():
    """Tests for dissoc_equil()."""
    
    # Edge cases
    assert np.allclose(dissoc_equil(1, 0, 0), np.array([0, 0, 0]))
    assert np.allclose(dissoc_equil(0, 0, 0), np.array([0, 0, 0]))
    assert np.allclose(dissoc_equil(0, 1, 1), np.array([0, 0, 1]))
    assert np.allclose(dissoc_equil(0, 1, 2), np.array([0, 1, 1]))
    assert np.allclose(dissoc_equil(0, 2, 1), np.array([1, 0, 1]))
    assert np.allclose(dissoc_equil(np.inf, 1, 1), np.array([1, 1, 0]))
    
    # Standard cases
    assert check_eq(4.5, 0.1, 3.2, *dissoc_equil(4.5, 0.1, 3.2))
    assert check_eq(0.14, 0.00011, 0.003, *dissoc_equil(0.14, 0.00011, 0.003))
    
    # Errors
    pytest.raises(RuntimeError, "dissoc_equil(-1, 1, 1)")
    pytest.raises(RuntimeError, "dissoc_equil(1, -1, 1)")
    pytest.raises(RuntimeError, "dissoc_equil(1, 1, -1)")
    pytest.raises(RuntimeError, "dissoc_equil(1, np.inf, 1)")
    pytest.raises(RuntimeError, "dissoc_equil(1, 1, np.inf)")
    
    return None

test_dissoc_equil()

Looks good! To be more thorough with out tests, we should test over a logarithmic range of dissociation constants in initial concentrations.

In [22]:
def test_dissoc_equil():
    """Tests for dissoc_equil()."""
    
    # Edge cases
    assert np.allclose(dissoc_equil(1, 0, 0), np.array([0, 0, 0]))
    assert np.allclose(dissoc_equil(0, 0, 0), np.array([0, 0, 0]))
    assert np.allclose(dissoc_equil(0, 1, 1), np.array([0, 0, 1]))
    assert np.allclose(dissoc_equil(0, 1, 2), np.array([0, 1, 1]))
    assert np.allclose(dissoc_equil(0, 2, 1), np.array([1, 0, 1]))
    assert np.allclose(dissoc_equil(np.inf, 1, 1), np.array([1, 1, 0]))
    
    # Standard cases
    Kd_vals = np.logspace(-10, 1, 50)
    ca0_vals = np.logspace(-5, 2, 50)
    cb0_vals = np.logspace(-5, 2, 50)
    for Kd in Kd_vals:
        for ca0 in ca0_vals:
            for cb0 in cb0_vals:
                assert check_eq(Kd, ca0, cb0, *dissoc_equil(Kd, ca0, cb0)), \
                    'Kd = %g, ca0 = %g, cb0 = %g' % (Kd, ca0, cb0)
    
    # Errors
    pytest.raises(RuntimeError, "dissoc_equil(-1, 1, 1)")
    pytest.raises(RuntimeError, "dissoc_equil(1, -1, 1)")
    pytest.raises(RuntimeError, "dissoc_equil(1, 1, -1)")
    pytest.raises(RuntimeError, "dissoc_equil(1, np.inf, 1)")
    pytest.raises(RuntimeError, "dissoc_equil(1, 1, np.inf)")
    
    return None
    
test_dissoc_equil()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-22-e325b23511c7> in <module>()
     28     return None
     29 
---> 30 test_dissoc_equil()

<ipython-input-22-e325b23511c7> in test_dissoc_equil()
     17         for ca0 in ca0_vals:
     18             for cb0 in cb0_vals:
---> 19                 assert check_eq(Kd, ca0, cb0, *dissoc_equil(Kd, ca0, cb0)),                     'Kd = %g, ca0 = %g, cb0 = %g' % (Kd, ca0, cb0)
     20 
     21     # Errors

AssertionError: Kd = 1e-10, ca0 = 1e-05, cb0 = 71.9686

We get an error when the difference in the total concentrations is too large. This highlights a numerical stability issue. When this is the case, we should find a more numerically stable way to compute the result. This is not a trivial matter, and something that requires careful thought. We will not develop a more numerically stable method here, but rather stop here having effectively shown how careful tests can reveal (possibly unexpected) weaknesses in otherwise simple code.