Exercise 2

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

Exercises 2.2 and 2.3 were inspired by Libeskind-Hadas and Bush, Computing for Biologists, Cambridge University Press, 2014.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This exercise was generated from a Jupyter notebook. You can download the notebook here.



For all problems except for practice with TDD and refactoring, we will work with real data from the Salmonella enterica genome. The section of the genome we will work with is in the file ~git/bootcamp/data/salmonella_spi1_region.fna. I cut it out of the full genome. It contains Salmonella pathogenicity island I (SPI1), which is contains genes for surface receptors for host-pathogen interactions.


Exercise 2.1: Parsing a FASTA file

There are packages, like Biopython and scikit-bio for processing files you encounter in bioinformatics. In this problem, though, we will work on our file I/O skills.

a) Use command line tools to investigate the FASTA file, which is located at ~git/bootcamp/data/salmonella_spi1_region.fna. You will notice that the first line begins with a >, signifying that the line contains information about the sequence. The remainder of the lines are the sequence itself.

b) Use the file I/O skill you have learned to read in the sequence and store it as a single string with no gaps.


Exercise 2.2: Pathogenicity islands

Pathogenicity islands are often marked by different GC content than the rest of the genome. We will try to locate the pathogenicity island(s) in our section of the Salmonella genome by computing GC content.

a) Use principles of TDD to write a function that divides a sequence into blocks and computes the GC content for each block, returning a tuple. The function signature should look like

gc_blocks(seq, block_size)

To be clear, if seq = 'ATGACTACGT' and block_size = 4, the blocks to be considered are

ATGA
CTAC

and the function should return (0.25, 0.5). Note that the blocks are non-overlapping and that we don't bother with the fact that end of the sequence that does not fit completely in a block.

b) Write a function that takes as input a sequence, block size, and a threshold GC content, and returns the original sequence where every base in a block with GC content above threshold is capitalized and every base below the threshold is lowercase. You would call the function like this:

mapped_seq = gc_map(seq, block_size, gc_thresh)

For example,

gc_map('ATGACTACGT', 4, 0.4)

returns 'atgaCTAC'. Note that bases not included in GC blocks are not included in the output sequence. Again, use principles of TDD.

c) Use the gc_map() function to generate a GC content map for the Salmonella sequence with block_size = 1000 and gc_thresh = 0.45. Where do you think the pathogenicity island is?

d) Write the GC-mapped sequence (with upper and lower characters) to a new FASTA file. Use the same description line (which began with a > in the original FASTA file), and have line breaks every 60 characters in the sequence.


Exercise 2.3: ORF detection

For this problem, again use principles of TDD to help you write your functions.

a) Write a function, longest_orf(), that takes a DNA sequence as input and finds the longest open reading frame (ORF) in the sequence (we will not consider reverse complements). A sequence fragment constitutes an ORF if the following are all true.

  1. It begins with ATG.
  2. It ends with any of TGA, TAG, or TAA.
  3. The total number of bases is a multiple of 3.

Note that the sequence ATG may appear in the middle of an ORF. So, for example,

GGATGATGATGTAAAAC

has two ORFs, ATGATGATGTAA and ATGATGTAA. You would return the first one, since it is longer of these two.

Hint: The statement for this problem is a bit ambiguous as it is written. What other specification might you need for this function?

b) Use your function to find the longest ORF from the section of the Salmonella genome we are investigating.

c) Write a function that converts a DNA sequence into a protein sequence. You can of course use the bootcamp_utils module.

d) Translate the longest ORF you generated in part (b) into a protein sequence and perform a BLAST search. Search for the protein sequence (a blastp query). What gene is it?

e) [Bonus challenge] Modify your function to return the n longest ORFs. Compute the five longest ORFs for the Salmonella genome section we are working with. Perform BLAST searches on them. What are they?


Exercise 2.4: Code refactoring challenge

Refer back to the exercises on TDD. You wrote some functions in the file seq_features.py that compute the number of positively and negatively charged residues in a protein.

Let's say we want to add some functionality. First, we want to compute the net charge for a sequence. That is the sum of the positively charged residues minus the sum o the negatively charged residues. Second, we want to include pH dependence on the protonation state of the residues, and therefore the charge. We will approximate the pH dependence as follows. All residues have their respective positive of negative charge, except as indicated below.

  • Below pH 4, glutamic acid and aspartic acid have no charge.
  • Above pH 8, histidine has no charge.
  • Above pH 10, arginine and lysine have no charge.

a) You want a function with signature net_charge(seq, pH=7) that returns the equilibrium net charge of the sequence. Your task it to add this functionality to the seq_features.py module using principle of TDD. Think carefully. You have some decisions to make. How will you write net_charge(seq, pH=7)? How will you refactor number_positives() and number_negatives()?

As you get started, the current state as of the completion of the TDD lessons of the two .py files are as follows.

seq_features.py

import bootcamp_utils

def is_valid_sequence(seq):
    for aa in seq:
        if aa not in bootcamp_utils.aa.keys():
            raise RuntimeError(aa + ' is not a valid amino acid.')

def number_negatives(seq):
    """Number of negative residues a protein sequence"""
    # Convert sequence to upper case
    seq = seq.upper()

    # Check for a valid sequence
    is_valid_sequence(seq)

    # Count E's and D's, since these are the negative residues
    return seq.count('E') + seq.count('D')

def number_positives(seq):
    """Number of positive residues a protein sequence"""
    # Convert sequence to upper case
    seq = seq.upper()

    # Check for a valid sequence
    is_valid_sequence(seq)

    return seq.count('R') + seq.count('K') + seq.count('H')

test_seq_features.py

import pytest
import seq_features

def test_number_negatives_single_E_or_D():
    """Perform unit tests on number_negative for single AA"""
    assert seq_features.number_negatives('E') == 1
    assert seq_features.number_negatives('D') == 1

def test_number_negatives_for_empty():
    """Perform unit tests on number_negative for empty entry"""
    assert seq_features.number_negatives('') == 0

def test_number_negatives_for_short_sequences():
    """Perform unit tests on number_negative for short sequence"""
    assert seq_features.number_negatives('ACKLWTTAE') == 1
    assert seq_features.number_negatives('DDDDEEEE') == 8

def test_number_negatives_for_lowercase():
    """Perform unit tests on number_negative for lowercase"""
    assert seq_features.number_negatives('acklwttae') == 1

def test_number_negatives_for_invalid_amino_acid():
    with pytest.raises(RuntimeError) as excinfo:
        seq_features.number_negatives('Z')
    excinfo.match("Z is not a valid amino acid")

def test_number_positives_single_R_K_or_H():
    """Perform unit tests on number_positives for single AA"""
    assert seq_features.number_positives('R') == 1
    assert seq_features.number_positives('K') == 1
    assert seq_features.number_positives('H') == 1

def test_number_positives_for_empty():
    """Perform unit tests on number_positives for empty entry"""
    assert seq_features.number_positives('') == 0

def test_number_positives_for_short_sequences():
    """Perform unit tests on number_positives for short sequence"""
    assert seq_features.number_positives('RCKLWTTRE') == 3
    assert seq_features.number_positives('DDDDEEEE') == 0

def test_number_positives_for_lowercase():
    """Perform unit tests on number_positives for lowercase"""
    assert seq_features.number_positives('rcklwttre') == 3

def test_number_positives_for_invalid_amino_acid():
    with pytest.raises(RuntimeError) as excinfo:
        seq_features.number_positives('Z')
    excinfo.match("Z is not a valid amino acid")

def test_number_negatives_for_invalid_amino_acid_anywhere():
    with pytest.raises(RuntimeError) as excinfo:
        seq_features.number_negatives('AZK')
    excinfo.match("Z is not a valid amino acid")

def test_number_positives_for_invalid_amino_acid_anywhere():
    with pytest.raises(RuntimeError) as excinfo:
        seq_features.number_positives('AZK')
    excinfo.match("Z is not a valid amino acid")

def test_is_valid_sequence_for_invalid_amino_acid():
    with pytest.raises(RuntimeError) as excinfo:
        seq_features.is_valid_sequence('Z')
    excinfo.match("Z is not a valid amino acid")    

def test_is_valid_sequence_for_invalid_amino_acid_anywhere():
    with pytest.raises(RuntimeError) as excinfo:
        seq_features.is_valid_sequence('AZK')
    excinfo.match("Z is not a valid amino acid")

def test_is_valid_sequence_for_other_invalid_amino_acid_anywhere():
    assert seq_features.is_valid_sequence('ALKSAYGS') is None

    with pytest.raises(RuntimeError) as excinfo:
        seq_features.is_valid_sequence('AZLL')
    excinfo.match("Z is not a valid amino acid")

    with pytest.raises(RuntimeError) as excinfo:
        seq_features.is_valid_sequence('ALLBJ')
    excinfo.match("B is not a valid amino acid")

    with pytest.raises(RuntimeError) as excinfo:
        seq_features.is_valid_sequence('AL%J')
    excinfo.match("% is not a valid amino acid")

b) If we want to be more precise, equilibrium charge of a residue is a fractional quantity. I.e., some fraction of the time it can be changed, and others not. Working out the chemical equilibria, if a given residue has a dissociation constant $K_a$ and in a solution with a hydrogen ion concentration of [H$^+$], then the fraction $f$ of residues that are deprotonated is

\begin{align} f = \frac{K_a/[\mathrm{H}^+]}{1 + K_a/[\mathrm{H}^+]}. \end{align}

So if we want to compute the equilibrium charge, this is a quantitative property, not just a number of charged residues, as we have been computing in number_positives() and number_negatives(). Now that we're using fractional charges, it is important to remember that you can't test for exact equality of floats, so you will have to test for approximate equality in some of your tests. Here are a couple more specifications/useful information for this task.

  • Treat all residues except aspartic acid, glutamic acid, arginine, lysine, and histidine as having zero charge.
  • The pKₐ of the respective residues are in the table below.
Residue shorthand pKₐ
aspartic acid D 3.9
glutamic acid E 4.3
arginine R 12.0
lysine K 10.5
histidine H 6.08

Think about the problem again. How will you write net_charge(seq, pH=7)? What refactoring of other functions is necessary? Go ahead and attack this problem using principles of TDD.