Lesson 36: Practice TDD

(c) 2016 Justin Bois and Shyam Saladi. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This exercise was generated from a Jupyter notebook. You can download the notebook here.

Practice 1: Identifying codons within nucleotide sequences

In Lesson 6, we discussed iteration; now that you're masters of the for loop, it probably seems like years ago. As part of the lesson, we developed code to search for a codon (a three base nucleotide sequence) with a provided sequence using a while loop. It's wrapped into a function below:

In [2]:
def find_codon_lesson6(codon, seq):
    """Find a specified codon with a given sequence.""" 
    
    i = 0
    # Scan sequence until we hit the start codon or the end of the sequence
    while seq[i:i+3] != codon and i < len(seq):
        i += 1
    
    if i == len(seq):
        return -1
    
    return i

Let's use the the coding and protein sequence for an isoform of the fruit fly (D. melanogaster)'s synaptobrevin, a super important part of the SNARE complex in cells! See here for more detail. These sequences are directly from the NIH's National Center for Biotechnology Information, the largest repository of biological information around.

http://www.ncbi.nlm.nih.gov/nuccore/665400015?from=147&to=476&report=fasta

>gi|665400015:147-476 Drosophila melanogaster synaptobrevin, transcript variant D (Syb), mRNA
ATGGAGAACAACGAAGCCCCCTCCCCCTCGGGATCCAACAACAACGAGAACAACAATGCAGCCCAGAAGA
AGCTGCAGCAGACCCAAGCCAAGGTGGACGAGGTGGTCGGGATTATGCGTGTGAACGTGGAGAAGGTCCT
GGAGCGGGACCAGAAGCTATCGGAACTGGGCGAGCGTGCGGATCAGCTGGAGCAGGGAGCATCCCAGTTC
GAGCAGCAGGCCGGCAAGCTGAAGCGCAAGCAATGGTGGGCCAACATGAAGATGATGATCATTCTGGGCG
TGATAGCCGTTGTGCTGCTCATCATCGTTCTGGTGTCGCTTTTCAATTGA

http://www.ncbi.nlm.nih.gov/protein/221330128?report=fasta

>gi|221330128|ref|NP_001137634.1| synaptobrevin, isoform D [Drosophila melanogaster]
MENNEAPSPSGSNNNENNNAAQKKLQQTQAKVDEVVGIMRVNVEKVLERDQKLSELGERADQLEQGASQF
EQQAGKLKRKQWWANMKMMIILGVIAVVLLIIVLVSLFN

a) Does find_codon_lesson6() work as you would like it to? To answer this question think about the following questions:

  1. Where does it find ATG in the nucleotide sequence provided above? Is the result correct?
  2. How about AAT?
  3. How about the codons TGT and TGT? What residue do these last two codons encode (for your convenience, standard genetic code)?
  4. Does this residue exist in the protein sequence above?

If you'd like, go ahead and use another tool to translate the nucleotide sequence to confirm yourself.

b) Unfortunately, at this point, you will have found a bug! Formalize this test by writing a new function test_find_codon(find_codon). It takes a single argument the function that is being tested. Write assert statements to check if find_codon returns the expected value for the 4 codons discussed above (ATG, AAT, TGT, TGC).

Remember, assert statements will raise an exception if they evaluate to False.

c) Code up a new function to find a specified codon within a given sequence by revising find_codon_lesson6. Of course, you should test it with test_find_codon() after it's written.

Practice 2: A chemical equilibrium calculator

In biochemistry, we often measure dissociation constants ($K_\mathrm{d}$) using various methods, usually involving titration. A dissociation constant is the equilibrium constant for the chemical reaction

\begin{align} \mathrm{ab} \rightleftharpoons \mathrm{a} + \mathrm{b}. \end{align}

It is related to the equilibrium concentration of the species a, b, and ab by

\begin{align} K_\mathrm{d} = \frac{c_\mathrm{a}\,c_\mathrm{b}}{c_{\mathrm{ab}}}. \end{align}

To analyze a titration curve, we often have to compute the equilibrium concentrations of the respective species given the parameters $K_\mathrm{d}$, $c_\mathrm{a}^0$, and $c_\mathrm{b}^0$, where the latter two parameters are the total concentrations of species a and b, respectively. This is typically known, as this is what the experimenter pipetted into the reaction solution.

As you probably worked out in your undergraduate general chemistry course, the respective equilibrium concentrations can be solved to be

\begin{align} c_\mathrm{ab} &= \frac{1}{2}\left(c_\mathrm{a}^0 + c_\mathrm{b}^0 + K_\mathrm{d} - \sqrt{\left(c_\mathrm{a}^0 + c_\mathrm{b}^0 + K_\mathrm{d}\right)^2 - 4c_\mathrm{a}^0 c_\mathrm{b}^0}\right), \\[1em] c_\mathrm{a} &= c_\mathrm{a}^0 - c_\mathrm{ab} \\[1em] c_\mathrm{b} &= c_\mathrm{b}^0 - c_\mathrm{ab} \end{align}

Your job is to write a function that computes the three concentrations, $c_\mathrm{a}$, $c_\mathrm{b}$, and $c_\mathrm{ab}$ given inputs $K_\mathrm{d}$, $c_\mathrm{a}^0$, and $c_\mathrm{b}^0$. Use the principles of TDD to develop it from the start. That is, define the call signature, write your first test, and make it fail. Then incrementally build your tests with your function.

There is one thing interesting about the way you define your tests for this function. You can check that the solution satisfies the equations describing chemical equilibrium and conservation of mass.

\begin{align} K_\mathrm{d} &= \frac{c_\mathrm{a}\,c_\mathrm{b}}{c_{\mathrm{ab}}}, \\[1em] c_\mathrm{a}^0 &= c_\mathrm{a} + c_\mathrm{ab}, \\[1em] c_\mathrm{b}^0 &= c_\mathrm{b} + c_\mathrm{ab}. \end{align}

Sometimes you have this luxury, a mathematical statement that must hold for the output of your function. This makes writing tests a bit easier and more intuitive. As a hint, for checking closeness of NumPy arrays, you can use the np.allclose() function.