Exercise 2

This exercise was generated from a Jupyter notebook. You can download the notebook here.

In [116]:
# Just need the regexes
import re

Problems 2.2 and 2.3 were inspired by Libeskind-Hadas and Bush, Computing for Biologists, Cambridge University Press, 2014.

For all problems, we will use our functions on real data from the Salmonella enterica genome. You can download the section of the genome we will work with here. I cut if out of the full genome. It contains Salmonella pathogenicity island I (SPI1), which is contains genes for surface receptors for host-pathogen interactions.

Problem 2.1: Parsing a FASTA file

In later tutorials, we will use some tools that facilitate parsing files in bioinformatics. In this problem, though, we will work on our file I/O skills. First, download the FASTA file containing the section of the Salmonella genome we will be working with.

a) Use command line tools to put the FASTA file in a directory where you can work with it. Use command line tools to investigate the file. You will notice that the first line begins with a >, signifying that the line contains information about the sequence. The remainder of the lines are the sequence itself.

b) Use the file I/O skill you have learned to read in the sequence and store it as a single string with no gaps.

Problem 2.1: solution

On my machine, the FASTA file with the portion of the Salmonella genome is located in ../data/salmonella_spi1_region.fna.

a) A quick look using less indicates that this is avalid FASTA file. I did a quick check to see how many sequences there are by searching for the > symbol using grep.

In [2]:
!grep ">" ../data/salmonella_spi1_region.fna
>gi|821161554|gb|CP011428.1| Salmonella enterica subsp. enterica strain YU39, complete genome, subsequence 3000000 to 3200000

We see that there is a single record. So, when we read it in, we just need to skip the first line and read on until we get the full record.

b) Because the file is not too big, we can just read it all in at once. We can then find the first newline character to trim off the first line, which is the description text. Finally, we just delete all newline characters to get our sequence. (This is just one of many ways to do this.)

In [72]:
# Read all the lines into a list
with open('../data/salmonella_spi1_region.fna', 'r') as f:
    file_str = f.read()

# Cut off first line, but keep descriptor
descriptor = file_str[:file_str.find('\n')]
file_str = file_str[file_str.find('\n')+1:]

# Eliminate newlines
seq = file_str.replace('\n', '')

# Take a look at the first 500 bases to make sure we got it right.

Looks good!

Problem 2.2: Pathogenicity islands

Pathogenicity islands are often marked by different GC content than the rest of the genome. We will try to locate the pathogenicity island(s) in our section of the Salmonella genome by computing GC content.

a) Write a function that divides a sequence into blocks and computes the GC content for each block, returning a tuple. The function signature should look like

gc_blocks(seq, block_size)

To be clear, if seq = 'ATGACTACGT and block_size = 4, the blocks to be considered are


and the function should return (0.25, 0.5). Note that the blocks are non-overlapping and that we don't bother with the end of the sequence that does not fit completely in a block.

b) Write a function that takes as input a sequence, block size, and a theshold GC content, and returns the original sequence where every base in a block with GC content above threshold is capitalized and every base below the threshold is lowercase. You would call the function like this:

mapped_seq = gc_map(seq, block_size, gc_thresh)

For example,

gc_map('ATGACTACGT', 4, 0.4)

returns 'atgaCTAC'. Note that bases not included in GC blocks are truncated.

c) Use the gc_map() function to generate a GC content map for the Salmonella sequence with block_size = 1000 and gc_thresh = 0.45. Where do you think the pathogenicity island is?

d) Write the GC-mapped sequence (with upper and lower characters) to a new FASTA file. Use the same description line (which began with a > in the original FASTA file), and have line breaks every 60 characters in the sequence.

Problem 2.3: Solution

a) The function is fairly simple. We loop through the sequence with a stride equal to the block size, computing the GC content for each subsequence of that length. We start with a function to compute GC content for a sequence.

In [89]:
def gc_content(seq):
    """GC content of a given sequence"""
    seq = seq.upper()
    return (seq.count('G') + seq.count('C')) / len(seq)

Next, we write the looping function.

In [90]:
def gc_blocks(seq, block_size):
    Divide sequence into non-overlapping blocks
    and compute GC content of each block.
    gc_blocks = []
    for i in range(0, len(seq), block_size):

b) We just use our already-written gc_content() function to decide how to modify the string of the sequence.

In [91]:
def gc_map(seq, block_size, gc_thresh):
    """Give back seq with lowercase letters where GC content is low."""

    out_seq = ''

    # Determine GC content of each block and change string accordingly
    for i in range(0, len(seq), block_size):
        if gc_content(seq[i:i+block_size]) < gc_thresh:
            out_seq += seq[i:i+block_size].lower()
            out_seq += seq[i:i+block_size].upper()

    return out_seq

c) Let's do it for Salmonella!

In [92]:
sal_gcmap = gc_map(seq, 1000, 0.45)

To save on display space, we will not display the sequence here. Scrolling through the GC map file generated in the next part, the pathogenicity island appears to occur about a quarter of the way into this subsequence.

d) To write the file out, we use the fact that we conveniently kept the description text when we parsed the Salmonella FASTA file in the first place. We then just write the sal_gcmap string in blocks of 60. We have to make sure to ge t the last few bases as well.

In [93]:
# Write the result
with open('salmonella_spi1_region_gc_map.fna', 'w') as f:
    # Write description text
    f.write(descriptor + '\n')

    # Write sequence in blocks of 60
    i = 0
    while i < len(sal_gcmap) - 59:
        f.write(sal_gcmap[i:i+60] + '\n')
        i += 60
    # Write last line

We'll take a quick look to see it worked out ok.

In [95]:
!head salmonella_spi1_region_gc_map.fna
!tail salmonella_spi1_region_gc_map.fna
>gi|821161554|gb|CP011428.1| Salmonella enterica subsp. enterica strain YU39, complete genome, subsequence 3000000 to 3200000

Problem 2.3: ORF detection

a) Write a function, longest_orf(), that takes a DNA sequence as input and finds the longest open reading frame (ORF) in the sequence (we will not consider reverse complements). A sequence fragment constitutes an ORF if the following are true.

  1. It begins with ATG.
  2. It ends with any of TGA, TAG, or TAA.
  3. The total number of bases is a multiple of 3.

Note that the sequence ATG may appear in the middle of an ORF. So, for example,


has two ORFs, ATGATGATGTAA and ATGATGTAA. You would return the first one, since it is longer of these two.

Use TDD principles. That is, be sure to write unit tests for this function.

Hint: The statement for this problem is a bit ambiguous as it is written. What other specification might need for this function. If you can't think of this right away, start writing tests, and it might become apparent to you.

b) Use your function to find the longest ORF from the section of the Salmonella genome we are investigating.

c) Write a function that converts a DNA sequence into a protein sequence. The dictionary we developed in Lesson 11 will help you.

d) Translate the longest ORF you generated in part (b) and perform a BLAST search. Search for the protein sequence (a blastp query). What gene is it?

e) [extra credit] Modify your function to return the $n$ longest ORFs. Compute the five longest ORFs for the Salmonella genome section we are working with. Perform BLAST searches on them. What are they?

Problem 2.3: solution

a) Something was missing in the specification. Namely, what do we do if there are two ORFs of the same longest length? Do we return the first one, second one, or both? I am arbitrarily choosing to return the one with the 3$'$-most starting index.

Looking ahead to part (d), I will first write a function to return all ORFs that are not entirely included in a longer ORFs. For ease of storage and comparison, I will simply store the ORFS as the index of the start of the ORF and the noninclusive index of the last.

Let's now discuss the algorithm we'll use. There are more efficient ways of finding ORFs, but I will choose a very clear way. We'll first find all start codons. For each start codon, we will find the first in-register stop codon. If there is an in-register stop codon, we store this start-stop pair. At the end, we sort them, longest to shortest.

So, we really have three functions we'll use. find_all_starts(seq) will return the indices of all start codons in a sequence. find_next_in_register_stop(seq) will scan sew and return the index of the next in register stop codon. If there is no such codon, it returns -1. Finally, all_orfs(seq) returns the sorted tuple of 2-tuples containing the start/stop pairs of the ORFs.

So, we should write tests for all of these functions.

In [3]:
def test_find_all_starts():
    assert find_all_starts('') == tuple()
    assert find_all_starts('GGAGACGACGCAAAAC'.lower()) == tuple()
    assert find_all_starts('AAAAAAATGAAATGAGGGGGGTATG'.lower()) == (6, 11, 22)
    assert find_all_starts('GGATGATGATGTAAAAC'.lower()) == (2, 5, 8)
    assert find_all_starts('GGATGCATGATGTAGAAC'.lower()) == (2, 6, 9)
    assert find_all_starts('GGGATGATGATGGGATGGTGAGTAGGGTAAG'.lower()) == \
                                                               (3, 6, 9, 14)
    assert find_all_starts('GGGatgatgatgGGatgGtgaGtagGGACtaaG'.lower()) == \
                                                               (3, 6, 9, 14)    
def test_find_first_in_register_stop():
    assert find_first_in_register_stop('') == -1
    assert find_first_in_register_stop('GTAATAGTGA'.lower()) == -1
    assert find_first_in_register_stop('AAAAAAAAAAAAAAATAAGGGTAA'.lower()) == 18
    assert find_first_in_register_stop('AAAAAACACCGCGTGTACTGA'.lower()) == 21
def test_all_orfs():
    assert all_orfs('') == tuple()
    assert all_orfs('GGAGACGACGCAAAAC') == tuple()
    assert all_orfs('AAAAAAATGAAATGAGGGGGGTATG') == ((6, 15),)
    assert all_orfs('GGATGATGATGTAAAAC') == ((2, 14),)
    assert all_orfs('GGATGCATGATGTAGAAC') == ((6, 15),)
    assert all_orfs('GGGATGATGATGGGATGGTGAGTAGGGTAAG') == ((3, 21),)
    assert all_orfs('GGGatgatgatgGGatgGtgaGtagGGACtaaG') == ((14, 32), (3, 21))

We'll start with the find_all_starts() function.

In [4]:
def find_all_starts(seq):
    """Find all start codons in sequence"""
    return None

And now we'll fail the test.

In [6]:
AssertionError                            Traceback (most recent call last)
<ipython-input-6-55021aff17ec> in <module>()
----> 1 test_find_all_starts()

<ipython-input-4-15abcef1717c> in test_find_all_starts()
      1 def test_find_all_starts():
----> 2     assert find_all_starts('') == tuple()
      3     assert find_all_starts('GGAGACGACGCAAAAC'.lower()) == tuple()
      4     assert find_all_starts('AAAAAAATGAAATGAGGGGGGTATG'.lower()) == (6, 11, 22)
      5     assert find_all_starts('GGATGATGATGTAAAAC'.lower()) == (2, 5, 8)


Now we'll write the function.

In [7]:
def find_all_starts(seq):
    """Find the starting index of all start codons in a lowercase seq"""
    # Compile regex for start codons
    regex_start = re.compile('atg')
    # Find the indices of all start codons
    starts = []
    for match in regex_start.finditer(seq):
    return tuple(starts)

And let's see if it passes the tests.

In [8]:

Yay! We passed! Now, let's move on to the next function, which finds the first in-register stop codon. Again, we fail, and then write the function.

In [9]:
def find_first_in_register_stop(seq):
    return None

AssertionError                            Traceback (most recent call last)
<ipython-input-9-c857d333ce02> in <module>()
      2     return None
----> 4 test_find_first_in_register_stop()

<ipython-input-4-15abcef1717c> in test_find_first_in_register_stop()
     11 def test_find_first_in_register_stop():
---> 12     assert find_first_in_register_stop('') == -1
     13     assert find_first_in_register_stop('GTAATAGTGA'.lower()) == -1
     14     assert find_first_in_register_stop('AAAAAAAAAAAAAAATAAGGGTAA'.lower()) == 18


Nice, beautiful failure. Now, we'll write the function and test it.

In [10]:
def find_first_in_register_stop(seq):
    Find first stop codon on lowercase seq that starts at an index
    that is divisible by three
    # Compile regexes for stop codons
    regex_stop = re.compile('(taa|tag|tga)')
    # Stop codon iterator
    stop_iterator = regex_stop.finditer(seq)

    # Find next stop codon that is in register
    for stop in stop_iterator:
        if stop.end() % 3 == 0:
            return stop.end()
    # Return -1 if we failed to find a stop codon
    return -1

And the test...

In [11]:

Passage! Finally, we apply TDD to write all_orfs().

In [103]:
def all_orfs(seq):
    """Return all ORFs of a sequence."""
    # Make sure sequence is all lower case
    seq = seq.lower()
    # Find the indices of all start codons
    starts = find_all_starts(seq)
    # Keep track of stops
    stops = []
    # Initialze ORFs.  Each entry in list is [ORF length, ORF start, ORF stop]
    orfs = []
    # For each start codon, find the next stop codon in register
    for start in starts:
        relative_stop = find_first_in_register_stop(seq[start:])
        if relative_stop != -1:
            # Index of stop codon
            stop = start + relative_stop
            # It already had stop, a longer ORF contains this one
            if stop not in stops:
                orfs.append((relative_stop, start, stop))
    # Get sorted list of ORF length
    orfs = sorted(orfs, reverse=True)
    # Remove lengths
    for i, orf in enumerate(orfs):
        orfs[i] = (orf[1], orf[2])
    return tuple(orfs)

Now for the tests....

In [104]:

Passage! We have succeed in generating an ordered list of the ORFs. Now, let's get what the problem specified, the longest ORF. Of course, we start with writing tests.

In [105]:
def test_longest_orf():
    assert longest_orf('') == ''
    assert longest_orf('GGAGACGACGCAAAAC') == ''
    assert longest_orf('GGATGATGATGTAAAAC') == 'ATGATGATGTAA'
    assert longest_orf('GGATGCATGATGTAGAAC') == 'ATGATGTAG'
    assert longest_orf('GGGatgatgatgGGatgGtgaGtagGGACtaaG') == \

We'll fail them....

In [106]:
def longest_orf(seq):
    return None

AssertionError                            Traceback (most recent call last)
<ipython-input-106-75f52821fc25> in <module>()
      2     return None
----> 4 test_longest_orf()

<ipython-input-105-ddc3d3850231> in test_longest_orf()
      1 def test_longest_orf():
----> 2     assert longest_orf('') == ''
      3     assert longest_orf('GGAGACGACGCAAAAC') == ''
      4     assert longest_orf('AAAAAAATGAAATGAGGGGGGTATG') == 'ATGAAATGA'
      5     assert longest_orf('GGATGATGATGTAAAAC') == 'ATGATGATGTAA'


We'll write our function, and then test it.

In [107]:
def longest_orf(seq):
    """Longest ORF of a sequence."""
    orfs = all_orfs(seq)
    if len(orfs) == 0:
        return ''
        return seq[orfs[0][0]:orfs[0][1]]
In [108]:

Passage! Success! We now have a reliable function for computing the longest ORF.

b) We simple use our new function to find the longest ORF of the Salmonella sequence.

In [109]:
# Compute it
salmonella_orf = longest_orf(seq)

# Look at it

c) We first borrow the code from Lesson 11 to get our dictionary of codons and amino acids. With this in hand, we can write our translate() function. We will scan the DNA sequence, generate a list of amino acids, and then join them into a protein sequence.

In [111]:
def translate(seq):
    """Translate a DNA sequence into protein"""
    # The set of DNA bases
    bases = ['T', 'C', 'A', 'G']

    # Build list of codons
    codons = []
    for first_base in bases:
        for second_base in bases:
            for third_base in bases:
                codons += [first_base + second_base + third_base]

    # The amino acids that are coded for (* = STOP codon)

    # Build dictionary from tuple of 2-tuples
    codon_dict = dict(zip(codons, amino_acids))

    # Make sure sequence is upper case
    seq = seq.upper()
    # Find start codon
    i = 0
    while i < len(seq) + 2 and seq[i:i+3] != 'ATG':
        i += 1

    # Translate until the stop codon or end of string
    prot = []
    while i < len(seq) - 2 and seq[i:i+3] not in ('TAA', 'TGA', 'TAG'):
        i += 3

    return ''.join(prot)

d) We can now translate the protein

In [112]:

Doing a BLAST search on this protein indicates that it is a histidine kinase involved in signaling.

e) We already are computing all of the ORFs. We an therefore just add a kwarg to our longest_orf() function to get the n longest ones.

In [113]:
def longest_orf(seq, n=1):
    """Longest ORF of a sequence."""
    orfs = all_orfs(seq)
    if len(orfs) == 0:
        return ''
    elif n == 1 or len(orfs) == 1:
        return seq[orfs[0][0]:orfs[0][1]]
        return_list = []
        for i in range(min(n, len(orfs))):
        return tuple(return_list)

Now, we'll compute the ORFs, translate them, and make a FASTA file to submit for a BLAST search.

In [114]:
# Compute ORFs
orfs = longest_orf(seq, n=5)

# Translate them
prots = []
for orf in orfs:
# Make a FASTA file
with open('sal_seqs.faa', 'w') as f:
    for i, prot in enumerate(prots):
        f.write('> {0:d}\n'.format(i))
        f.write(prot + '\n')

We can take a look at it to see what I did with the above code.

In [115]:
!cat sal_seqs.faa
> 0
> 1
> 2
> 3
> 4

Upon doing the BLAST search, I found that the genes are:

Length rank Description
1 histine kinase
2 DNA rapair protein MutS
3 formate hydrogenlyase transcriptional activator
4 L-fucose isomerase
5 transcriptional regulator HilA (invasion regulator)