Lesson 36: Practice with TDD

(c) 2017 Justin Bois and Shyam Saladi. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This lesson was generated from a Jupyter notebook. You can download the notebook here.

Practice 1: Identifying codons within nucleotide sequences

In Lesson 6, we discussed iteration; now that you're masters of the for loop, it probably seems like years ago. As part of the lesson, we developed code to search for a codon (a three base nucleotide sequence) with a provided sequence using a while loop. It's wrapped into a function below:

In [3]:
def find_codon_lesson6(codon, seq):
    """Find a specified codon with a given sequence.""" 
    
    i = 0
    # Scan sequence until we hit the start codon or the end of the sequence
    while seq[i:i+3] != codon and i < len(seq):
        i += 1
    
    if i == len(seq):
        return -1
    
    return i

Let's use the DNA and protein sequence for an isoform of D. melanogaster's synaptobrevin, a very important part of the SNARE complex in cells. See here for more detail. These sequences are directly from the NIH's National Center for Biotechnology Information, the largest repository of biological information around.

http://www.ncbi.nlm.nih.gov/nuccore/665400015?from=147&to=476&report=fasta

>gi|665400015:147-476 Drosophila melanogaster synaptobrevin, transcript variant D (Syb), mRNA
ATGGAGAACAACGAAGCCCCCTCCCCCTCGGGATCCAACAACAACGAGAACAACAATGCAGCCCAGAAGA
AGCTGCAGCAGACCCAAGCCAAGGTGGACGAGGTGGTCGGGATTATGCGTGTGAACGTGGAGAAGGTCCT
GGAGCGGGACCAGAAGCTATCGGAACTGGGCGAGCGTGCGGATCAGCTGGAGCAGGGAGCATCCCAGTTC
GAGCAGCAGGCCGGCAAGCTGAAGCGCAAGCAATGGTGGGCCAACATGAAGATGATGATCATTCTGGGCG
TGATAGCCGTTGTGCTGCTCATCATCGTTCTGGTGTCGCTTTTCAATTGA

http://www.ncbi.nlm.nih.gov/protein/221330128?report=fasta

>gi|221330128|ref|NP_001137634.1| synaptobrevin, isoform D [Drosophila melanogaster]
MENNEAPSPSGSNNNENNNAAQKKLQQTQAKVDEVVGIMRVNVEKVLERDQKLSELGERADQLEQGASQF
EQQAGKLKRKQWWANMKMMIILGVIAVVLLIIVLVSLFN

a) Does find_codon_lesson6() work as you would like it to? To answer this question think about the following questions:

  1. Where does it find ATG in the nucleotide sequence provided above? Is the result correct?
  2. How about AAT?
  3. How about the codons TGT and TGC? What residue do these last two codons encode (for your convenience here is a link to the standard genetic code)?
  4. Does this residue exist in the protein sequence above?

If you'd like, go ahead and use another tool to translate the nucleotide sequence to confirm yourself.

b) Unfortunately, at this point, you will have found a bug! Formalize this test by writing a new function test_find_codon(find_codon). It takes a single argument the function that is being tested. Write assert statements to check if find_codon returns the expected value for the 4 codons discussed above (ATG, AAT, TGT, TGC).

Remember, assert statements will raise an exception if they evaluate to False.

c) Code up a new function to find a specified codon within a given sequence by revising find_codon_lesson6. Of course, you should test it with test_find_codon() after it's written.


Practice 2: A chemical equilibrium calculator

In biochemistry, we often measure dissociation constants ($K_\mathrm{d}$) using various methods, usually involving titration. A dissociation constant is the equilibrium constant for the chemical reaction

\begin{align} \mathrm{ab} \rightleftharpoons \mathrm{a} + \mathrm{b}. \end{align}

It is related to the equilibrium concentration of the species a, b, and ab by

\begin{align} K_\mathrm{d} = \frac{c_\mathrm{a}\,c_\mathrm{b}}{c_{\mathrm{ab}}}. \end{align}

To analyze a titration curve, we often have to compute the equilibrium concentrations of the respective species given the parameters $K_\mathrm{d}$, $c_\mathrm{a}^0$, and $c_\mathrm{b}^0$, where the latter two parameters are the total concentrations of species a and b, respectively. $c_\mathrm{a}^0$ and $c_\mathrm{b}^0$ are typically known, as this is what the experimenter pipetted into the reaction solution.

As you probably worked out in your undergraduate general chemistry course, the respective equilibrium concentrations can be solved to be

\begin{align} c_\mathrm{ab} &= \frac{1}{2}\left(c_\mathrm{a}^0 + c_\mathrm{b}^0 + K_\mathrm{d} - \sqrt{\left(c_\mathrm{a}^0 + c_\mathrm{b}^0 + K_\mathrm{d}\right)^2 - 4c_\mathrm{a}^0 c_\mathrm{b}^0}\right), \\[1em] c_\mathrm{a} &= c_\mathrm{a}^0 - c_\mathrm{ab} \\[1em] c_\mathrm{b} &= c_\mathrm{b}^0 - c_\mathrm{ab} \end{align}

Your job is to write a function that computes the three concentrations, $c_\mathrm{a}$, $c_\mathrm{b}$, and $c_\mathrm{ab}$ given inputs $K_\mathrm{d}$, $c_\mathrm{a}^0$, and $c_\mathrm{b}^0$. Use the principles of TDD to develop it from the start. That is, define the call signature, write your first test, and make it fail. Then incrementally build your tests for your function.

There is one thing interesting about the way you define your tests for this function. You can check that the solution satisfies the equations describing chemical equilibrium and conservation of mass.

\begin{align} K_\mathrm{d} &= \frac{c_\mathrm{a}\,c_\mathrm{b}}{c_{\mathrm{ab}}}, \\[1em] c_\mathrm{a}^0 &= c_\mathrm{a} + c_\mathrm{ab}, \\[1em] c_\mathrm{b}^0 &= c_\mathrm{b} + c_\mathrm{ab}. \end{align}

Sometimes you have this luxury, a mathematical statement that must hold for the output of your function. This makes writing tests a bit easier and more intuitive. As a hint, for checking closeness NumPy arrays, use the np.allclose() function.