Lesson 17: Review of exercise 2¶

Exercises 2.3 and 2.4 were inspired by Libeskind-Hadas and Bush, Computing for Biologists, Cambridge University Press, 2014.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This lesson was generated from a Jupyter notebook. You can download the notebook here.

In [1]:
import re
import numpy as np
import bootcamp_utils


Exercise 2.1: Parsing a FASTA file¶

There are packages, like Biopython and scikit-bio for processing files you encounter in bioinformatics. In this problem, though, we will work on our file I/O skills.

a) Use command line tools to investigate the FASTA file located at ~git/bootcamp/data/salmonella_spi1_region.fna. This file contains a portion of the Salmonella genome (described in Exercise 2.3).

You will notice that the first line begins with a >, signifying that the line contains information about the sequence. The remainder of the lines are the sequence itself.

b) The format of the Salmonella SPI1 region FASTA file is a common format for such files (though oftentimes FASTA files contain multiple sequences). Use the file I/O skills you have learned to write a function to read in a sequence from a FASTA file containing a single sequence (but possibly having the first line in the file beginning with >). Your function should take as input the name of the FASTA file and return two strings. First, it should return the descriptor string (which starts with >). Second, it should return a string with no gaps containing the sequence.

Test your function on the Salmonella sequence.

Exercise 2.1: solution¶

a) A quick look using less indicates that this is a valid FASTA file. I did a quick check to see how many sequences there are by searching for the > symbol using grep.

In [2]:
!grep ">" data/salmonella_spi1_region.fna

>gi|821161554|gb|CP011428.1| Salmonella enterica subsp. enterica strain YU39, complete genome, subsequence 3000000 to 3200000

In [3]:
!grep ">" data/lambdaDNA.fasta

>Lambda_NEB


We see that in both there is a single record. So, when we read it in, we just need to skip the first line and read on until we get the full record.

b) We will read in the files line by line, saving the first line as the descriptor, and then building the sequence as we go along.

In [4]:
def read_fasta(filename):
"""Read a sequence in from a FASTA file containing a single sequence.

We assume that the first line of the file is the descriptor and all
subsequent lines are sequence.
"""
with open(filename, 'r') as f:

# Read in sequence, stripping the whitespace from each line
seq = ''
while line != '':
seq += line

return descriptor, seq


Of course, when writing this function, we should be taking a test-driven development approach, but here we are focusing on our file I/O and string handling skills.

Let's take the function for a drive!

In [5]:
descriptor, seq = read_fasta('data/salmonella_spi1_region.fna')

# Take a look at the first 500 bases to make sure we got it right.
seq[:500]

Out[5]:
'AAAACCTTAGTAACTGGACTGCTGGGATTTTTCAGCCTGGATACGCTGGTAGATCTCTTCACGATGGACAGAAACTTCTTTCGGGGCGTTCACGCCAATACGCACCTGGTTGCCCTTCACCCCTAAAACTGTCACGGTGACCTCATCGCCAATCATGAGGGTCTCACCAACTCGACGAGTCAGAATCAGCATTCTTTGCTCCTTGAAAGATTAAAAGAGTCGGGTCTCTCTGTATCCCGGCATTATCCATCATATAACGCCAAAAAGTAAGCGATGACAAACACCTTAGGTGTAAGCAGTCATGGCATTACATTCTGTTAAACCTAAGTTTAGCCGATATACAAAACTTCAACCTGACTTTATCGTTGTCGATAGCGTTGACGTAAACGCCGCAGCACGGGCTGCGGCGCCAACGAACGCTTATAATTATTGCAATTTTGCGCTGACCCAGCCTTGTACACTGGCTAACGCTGCAGGCAGAGCTGCCGCATCCGTACCAC'

Looks good!

And for the λ-DNA....

In [6]:
descriptor_lambda, seq_lambda = read_fasta('data/lambdaDNA.fasta')

# Take a look at the first 500 bases to make sure we got it right.
seq_lambda[:500]

Out[6]:
'GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCGTCATAACTTAATGTTTTTATTTAAAATACCCTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAACAAAAAGCAGCTGGCTGACATTTTCGGTGCGAGTATCCGTACCATTCAGAACTGGCAGGAACAGGGAATGCCCGTTCTGCGAGGCGGTGGCAAGGGTAATGAGGTGCTTTATGACTCTGCCGCCGTCATAAAATGGTATGCCGAAAGGGATGCTGAAATTGAGAACGAAAAGCTGCGCCGGGAGGTTGAAGAACTGCGGCAGGCCAGCGAGGCAGATCTCCAGCCAGGAACTATTGAGTACGAACGCCATCGACTTACGCGTGCGCAGGCCGACGCACAGGAACTGAAGAATGCCAGAG'

Exercise 2.2: Restriction cut sites¶

Restriction enzymes cut DNA at specific locations called restriction sites. The sequence at a restriction site is called a recognition sequence. Here are the recognition sequences of some commonly used restriction enzymes.

Restriction enzyme Recognition sequence
HindIII AAGCTT
EcoRI GAATTC
KpnI GGTACC

a) Download the FASTA file (New England Biosystems) from containing the genome of λ-phage, a bacteriophage that infect E. coli, here. Use the function you wrote in Exercise 2.1 to extract the sequence.

b) Write a function with call signature

restriction_sites(seq, recoq_seq)


that takes as arguments a sequence and the recognition sequence of a restriction enzyme sites and returns the indices of the first base or each of the restriction sites in the sequence. Use this function to find the indices of the restriction sites of λ-DNA for HindIII, EcoRI, and KpnI. Compare your results to those reported on the New England Biosystems datasheet.

Exercise 2.2: solution¶

a) I have downloaded the FASTA file to data/lambdaDNA.fasta. Let's load it in.

In [7]:
_, seq = read_fasta('data/lambdaDNA.fasta')


We can check to make sure we got the whole sequence, which should be about 48.5 kbp.

In [8]:
len(seq)

Out[8]:
48502

Looks good!

b) Our goal is to find all locations of the substring given by the recognition sequence in the genome. This is most easily accomplished using the re module that uses regular expressions. We covered this in bootcamp 2015, but have since cut it from the course. Specifically, we use re.finditer() to automatically find all occurrences of the recognition sequence in the sequence, and then use the start() method to get the first index.

In [9]:
def restriction_sites_with_re(seq, recog_seq):
"""Find the indices of all restriction sites in a sequence."""
sites = []
for site in re.finditer(recog_seq, seq):
sites.append(site.start())

return sites


Let's check our restriction sites against the know sites.

In [10]:
print('HindIII:', restriction_sites_with_re(seq, 'AAGCTT'))
print('EcoRI:  ', restriction_sites_with_re(seq, 'GAATTC'))
print('KpnI:   ', restriction_sites_with_re(seq, 'GGTACC'))

HindIII: [23129, 25156, 27478, 36894, 37458, 37583, 44140]
EcoRI:   [21225, 26103, 31746, 39167, 44971]
KpnI:    [17052, 18555]


Mostly, these match exactly the listed values, except each index reported by our function is one less than the listed values because we are indexing starting at zero. The one example that is different from the data sheet is that the data sheet does not contain the cut site at index 37583. This could be because NEB made a mistake either in the sequencing or in determining their cut cits, or because they validated the cut sites experimentally, and for some reason there is some strange structure around that cut site that precludes cutting.

I will now write a function do to the same calculation without using regular expressions. We will simply make a pass through the sequence and store the indices where we match the recognition sequence.

In [11]:
def restriction_sites(seq, recog_seq):
"""Find the indices of all restriction sites in a sequence."""
# Initialize list of restriction sites
sites = []

# Check every substring for a match
for i in range(len(seq) - len(recog_seq)):
if seq[i:i+len(recog_seq)] == recog_seq:
sites.append(i)

return sites


And let's use this function.

In [12]:
print('HindIII:', restriction_sites(seq, 'AAGCTT'))
print('EcoRI:  ', restriction_sites(seq, 'GAATTC'))
print('KpnI:   ', restriction_sites(seq, 'GGTACC'))

HindIII: [23129, 25156, 27478, 36894, 37458, 37583, 44140]
EcoRI:   [21225, 26103, 31746, 39167, 44971]
KpnI:    [17052, 18555]


For fun, we can compare the speeds of the two functions using the magic function %timeit. This function performs a calculation many times and computes a mean execution time. It can be used to check speed of your functions.

In [13]:
%timeit restriction_sites(seq, 'AAGCTT')
%timeit restriction_sites_with_re(seq, 'AAGCTT')

10.6 ms ± 76.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
166 µs ± 3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


The regular expression-based method is nearly 100 times faster! This is often the case; hand-coding something in pure Python can be slow compared to using packages that use pre-compiled code like re. However, as Donald Knuth said, "The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming." Get your code working, even if it is slow, and optimize only if you have to.

Exercise 2.3: Pathogenicity islands¶

For this and the next problem, we will work with real data from the Salmonella enterica genome. The section of the genome we will work with is in the file ~git/bootcamp/data/salmonella_spi1_region.fna. I cut it out of the full genome. It contains Salmonella pathogenicity island I (SPI1), which contains genes for surface receptors for host-pathogen interactions.

Pathogenicity islands are often marked by different GC content than the rest of the genome. We will try to locate the pathogenicity island(s) in our section of the Salmonella genome by computing GC content.

a) Use principles of TDD to write a function that divides a sequence into blocks and computes the GC content for each block, returning a tuple. The function signature should look like

gc_blocks(seq, block_size)



To be clear, if seq = 'ATGACTACGT' and block_size = 4, the blocks to be considered are

ATGA
CTAC



and the function should return (0.25, 0.5). Note that the blocks are non-overlapping and that we don't bother with the fact that end of the sequence that does not fit completely in a block.

b) Write a function that takes as input a sequence, block size, and a threshold GC content, and returns the original sequence where every base in a block with GC content above threshold is capitalized and every base below the threshold is lowercase. You would call the function like this:

mapped_seq = gc_map(seq, block_size, gc_thresh)



For example,

gc_map('ATGACTACGT', 4, 0.4)



returns 'atgaCTAC'. Note that bases not included in GC blocks are not included in the output sequence. Again, use principles of TDD.

c) Use the gc_map() function to generate a GC content map for the Salmonella sequence with block_size = 1000 and gc_thresh = 0.45. Where do you think the pathogenicity island is?

d) Write the GC-mapped sequence (with upper and lower characters) to a new FASTA file. Use the same description line (which began with a > in the original FASTA file), and have line breaks every 60 characters in the sequence.

Exercise 2.3: Solution¶

a) First let's write our tests. In writing the tests, I am making the design decision that I will count the characters G, g, C, and c as contributing to GC content, and that I will not check to make sure the sequence is valid. I also make the design decision that an empty sequence has zero GC content.

In [14]:
def test_gc_content():
assert gc_content('') == 0.0
assert gc_content('G') == 1.0
assert gc_content('g') == 1.0
assert gc_content('C') == 1.0
assert gc_content('c') == 1.0
assert gc_content('gcgcgc') == 1.0
assert gc_content('aaatatata') == 0.0
assert np.isclose(gc_content('ggatcggcga'), 0.7)
assert np.isclose(gc_content('attgggggcaatta'), 3/7)


The function is fairly simple. We loop through the sequence with a stride equal to the block size, computing the GC content for each subsequence of that length. We start with a function to compute GC content for a sequence.

In [15]:
def gc_content(seq):
"""GC content of a given sequence"""
if seq == '':
return 0.0

seq = seq.upper()
return (seq.count('G') + seq.count('C')) / len(seq)


Now let's test it.

In [16]:
test_gc_content()


Passage! Next, we write the looping function, starting with its tests.

In [17]:
def test_gc_blocks():
assert gc_blocks('', 10) == tuple()
assert gc_blocks('gcgcgcgcg', 10) == tuple()
assert gc_blocks('gcgcgcg', 4) == (1.0,)
assert gc_blocks('gcgcgcgc', 4) == (1.0, 1.0)
assert gc_blocks('gcgcgcgcat', 4) == (1.0, 1.0)

test_tuple = gc_blocks('gcgagcgcat', 4)
assert np.isclose(test_tuple[0], 0.75) and test_tuple[1] == 1.0

assert gc_blocks('gcgtagagc', 1) == (1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0)
assert gc_blocks('gcgtagagc', 2) == (1.0, 0.5, 0.5, 0.5)
assert np.isclose(gc_blocks('gcgtagagc', 3), (1.0, 1/3, 2/3)).sum() == 3


Now let's write our function.

In [18]:
def gc_blocks(seq, block_size):
"""
Divide sequence into non-overlapping blocks
and compute GC content of each block.
"""
blocks = []
for i in range(0, len(seq) - (len(seq) % block_size), block_size):
blocks.append(gc_content(seq[i:i+block_size]))
return tuple(blocks)


And the tests....

In [19]:
test_gc_blocks()


Success! Let's take this function for a spin, looking at 1000-base blocks.

In [20]:
gc_blocks(seq, 1000)

Out[20]:
(0.516,
0.543,
0.551,
0.584,
0.604,
0.594,
0.55,
0.578,
0.554,
0.572,
0.58,
0.582,
0.59,
0.574,
0.585,
0.562,
0.584,
0.589,
0.535,
0.55,
0.589,
0.515,
0.424,
0.308,
0.406,
0.364,
0.351,
0.397,
0.447,
0.434,
0.474,
0.502,
0.512,
0.429,
0.403,
0.456,
0.41,
0.435,
0.451,
0.522,
0.5,
0.475,
0.496,
0.461,
0.498,
0.492,
0.46,
0.375)

We get a tuple of GC content, which is hard to look at on screen, but this is useful for plotting GC content over the course of a sequence. We will learn how to plot later in the bootcamp.

b) We just use our already-written gc_content() function to decide how to modify the string of the sequence. First, the tests. We make the design decision that we will truncate the sequence if the 3'-most end is shorter than the block length.

In [21]:
def test_gc_map():
assert gc_map('', 10, 0.5) == ''
assert gc_map('ATATATATA', 4, 0.5) == 'atatatat'
assert gc_map('GCGCGCGCG', 4, 0.5) == 'GCGCGCGC'
assert gc_map('GATCGATCC', 4, 0.5) == 'GATCGATC'
assert gc_map('GATCGATCC', 4, 0.51) == 'gatcgatc'
assert gc_map('GATCGATCC', 3, 0.5) == 'gatCGATCC'
assert gc_map('GATCGATCC', 3, 0.75) == 'gatcgatcc'
assert gc_map('GATCGATCC', 3, 0.25) == 'GATCGATCC'


Now the function....

In [22]:
def gc_map(seq, block_size, gc_thresh):
"""Give back seq with lowercase letters where GC content is low."""
out_seq = ''

# Determine GC content of each block and change string accordingly
for i in range(0, len(seq) - (len(seq) % block_size), block_size):
if gc_content(seq[i:i+block_size]) < gc_thresh:
out_seq += seq[i:i+block_size].lower()
else:
out_seq += seq[i:i+block_size].upper()

return out_seq


And the tests.

In [23]:
test_gc_map()


Passage! We can now use these functions to analyze sequences of interest.

c) Let's do it for Salmonella!

In [24]:
sal_gcmap = gc_map(seq, 1000, 0.45)


To save on display space, we will not display the sequence here. Scrolling through the GC map file generated in the next part, the pathogenicity island appears to occur about a quarter of the way into this subsequence.

d) To write the file out, we use the fact that we conveniently kept the description text when we parsed the Salmonella FASTA file in the first place. We then just write the sal_gcmap string in blocks of 60. We have to make sure to get the last few bases as well.

In [25]:
# Write the result
with open('salmonella_spi1_region_gc_map.fna', 'w') as f:
# Write description text
f.write(descriptor + '\n')

# Write sequence in blocks of 60
i = 0
while i < len(sal_gcmap) - 59:
f.write(sal_gcmap[i:i+60] + '\n')
i += 60

# Write last line
f.write(sal_gcmap[i:] + '\n')


We'll take a quick look to see it worked out ok.

In [26]:
!head salmonella_spi1_region_gc_map.fna
print('...')
!tail salmonella_spi1_region_gc_map.fna

>gi|821161554|gb|CP011428.1| Salmonella enterica subsp. enterica strain YU39, complete genome, subsequence 3000000 to 3200000
GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCG
TTCTTCTTCGTCATAACTTAATGTTTTTATTTAAAATACCCTCTGAAAAGAAAGGAAACG
ACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTGTCGTTTCCTTTCTCTGTTTTTGTCCGT
GGAATGAACAATGGAAGTCAACAAAAAGCAGCTGGCTGACATTTTCGGTGCGAGTATCCG
TACCATTCAGAACTGGCAGGAACAGGGAATGCCCGTTCTGCGAGGCGGTGGCAAGGGTAA
TGAGGTGCTTTATGACTCTGCCGCCGTCATAAAATGGTATGCCGAAAGGGATGCTGAAAT
TGAGAACGAAAAGCTGCGCCGGGAGGTTGAAGAACTGCGGCAGGCCAGCGAGGCAGATCT
CCAGCCAGGAACTATTGAGTACGAACGCCATCGACTTACGCGTGCGCAGGCCGACGCACA
GGAACTGAAGAATGCCAGAGACTCCGCTGAAGTGGTGGAAACCGCATTCTGTACTTTCGT
...
ccttccatgtgatacgagggcgcgtagtttgcattatcgtttttatcgtttcaatctggt
ctgacctccttgtgttttgttgatgatttatgtcaaatattaggaatgttttcacttaat
agtattggttgcgtaacaaagtgcggtcctgctggcattctggagggaaatacaaccgac
agatgtatgtaaggccaacgtgctcaaatcttcatacagaaagatttgaagtaatatttt
aaccgctagatgaagagcaagcgcatggagcgacaaaatgaataaagaacaatctgctga
tgatccctccgtggatctgattcgtgtaaaaaatatgcttaatagcaccatttctatgag
ttaccctgatgttgtaattgcatgtatagaacataaggtgtctctggaagcattcagagc
aattgaggcagcgttggtgaagcacgataataatatgaaggattattccctggtggttga
ctgatcaccataactgctaatcattcaaactatttagtctgtgacagagccaacacgcag



Looks good!

Exercise 2.4: ORF detection¶

a) Write a function, longest_orf(), that takes a DNA sequence as input and finds the longest open reading frame (ORF) in the sequence (we will not consider reverse complements). A sequence fragment constitutes an ORF if the following are all true.

1. It begins with ATG.
2. It ends with any of TGA, TAG, or TAA.
3. The total number of bases is a multiple of 3.

Note that the sequence ATG may appear in the middle of an ORF. So, for example,

GGATGATGATGTAAAAC



has two ORFs, ATGATGATGTAA and ATGATGTAA. You would return the first one, since it is longer of these two.

Hint: The statement for this problem is a bit ambiguous as it is written. What other specification might you need for this function?

b) Use your function to find the longest ORF from the section of the Salmonella genome we are investigating.

c) Write a function that converts a DNA sequence into a protein sequence. You can of course use the bootcamp_utils module.

d) Translate the longest ORF you generated in part (b) into a protein sequence and perform a BLAST search. Search for the protein sequence (a blastp query). What gene is it?

e) [Bonus challenge] Modify your function to return the n longest ORFs. Compute the five longest ORFs for the Salmonella genome section we are working with. Perform BLAST searches on them. What are they?

Exercise 2.4: solution¶

a) Something was missing in the specification. Namely, what do we do if there are two ORFs of the same longest length? Do we return the first one, second one, or both? I am arbitrarily choosing to return the one with the 3$'$-most starting index.

Looking ahead to part (e), I will first write a function to return all ORFs that are not entirely included in a longer ORF. For ease of storage and comparison, I will simply store the ORFS as the index of the start of the ORF and the noninclusive index of the last.

Let's now discuss the algorithm we'll use. There are more efficient ways of finding ORFs, but I will choose a very clear way. We'll first find all start codons. For each start codon, we will find the first in-register stop codon. If there is an in-register stop codon, we store this start-stop pair. At the end, we sort them, longest to shortest.

So, we really have three functions we'll use. find_all_starts(seq) will return the indices of all start codons in a sequence. find_next_in_register_stop(seq) will scan a sequence that starts with ATG and return the exclusive final index of the next in register stop codon. In other words, and ORF starting at index start is given by

seq[start:start+find_next_in_register_stop(seq[start:])]



If there is no such codon, find_next_in_register_stop() returns -1. Finally, all_orfs(seq) returns the sorted tuple of 2-tuples containing the start/stop pairs of the ORFs.

I will use TDD principles for designing these functions, writing the test cases first.

In [27]:
def test_find_all_starts():
assert find_all_starts('') == tuple()
assert find_all_starts('GGAGACGACGCAAAAC'.lower()) == tuple()
assert find_all_starts('AAAAAAATGAAATGAGGGGGGTATG'.lower()) == (6, 11, 22)
assert find_all_starts('GGATGATGATGTAAAAC'.lower()) == (2, 5, 8)
assert find_all_starts('GGATGCATGATGTAGAAC'.lower()) == (2, 6, 9)
assert find_all_starts('GGGATGATGATGGGATGGTGAGTAGGGTAAG'.lower()) == \
(3, 6, 9, 14)
assert find_all_starts('GGGatgatgatgGGatgGtgaGtagGGACtaaG'.lower()) == \
(3, 6, 9, 14)

def test_find_first_in_register_stop():
assert find_first_in_register_stop('') == -1
assert find_first_in_register_stop('GTAATAGTGA'.lower()) == -1
assert find_first_in_register_stop('AAAAAAAAAAAAAAATAAGGGTAA'.lower()) == 18
assert find_first_in_register_stop('AAAAAACACCGCGTGTACTGA'.lower()) == 21

def test_all_orfs():
assert all_orfs('') == tuple()
assert all_orfs('GGAGACGACGCAAAAC') == tuple()
assert all_orfs('AAAAAAATGAAATGAGGGGGGTATG') == ((6, 15),)
assert all_orfs('GGATGATGATGTAAAAC') == ((2, 14),)
assert all_orfs('GGATGCATGATGTAGAAC') == ((6, 15),)
assert all_orfs('GGGATGATGATGGGATGGTGAGTAGGGTAAG') == ((3, 21),)
assert all_orfs('GGGatgatgatgGGatgGtgaGtagGGACtaaG') == ((14, 32), (3, 21))


We'll start with the find_all_starts() function.

In [28]:
def find_all_starts(seq):
"""Find all start codons in sequence"""
return None


And now we'll fail the test.

In [29]:
test_find_all_starts()

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
----> 1 test_find_all_starts()

<ipython-input-27-5c76a7a3caff> in test_find_all_starts()
1 def test_find_all_starts():
----> 2     assert find_all_starts('') == tuple()
3     assert find_all_starts('GGAGACGACGCAAAAC'.lower()) == tuple()
4     assert find_all_starts('AAAAAAATGAAATGAGGGGGGTATG'.lower()) == (6, 11, 22)
5     assert find_all_starts('GGATGATGATGTAAAAC'.lower()) == (2, 5, 8)

AssertionError: 

Now we'll write the function. I'll use regular expressions first, but will also code up the function without them.

In [30]:
def find_all_starts(seq):
"""Find the starting index of all start codons in a lowercase seq"""
# Compile regex for start codons
regex_start = re.compile('atg')

# Find the indices of all start codons
starts = []
for match in regex_start.finditer(seq):
starts.append(match.start())

return tuple(starts)


And let's see if it passes the tests.

In [31]:
test_find_all_starts()


Yay! We passed! However, since we did not learn regular expressions this year, I will write a function that finds all start codons that does not use them.

In [32]:
def find_all_starts(seq):
"""Find the starting index of all start codons in a lowercase seq"""
# Initialize array of indices of start codons
starts = []

# Find index of first start codon (remember, find() returns -1 if not found)
i = seq.find('atg')

# Keep looking for subsequence incrementing starting point of search
while i >= 0:
starts.append(i)
i = seq.find('atg', i + 1)

return tuple(starts)


Now let's test this new find_all_starts() function

In [33]:
test_find_all_starts()


It passed! Yay! Note how useful TDD is here. Whenever we try new ways of doing things, we can use the tests to make sure we didn't break anything.

Now, let's move on to the next function, which finds the first in-register stop codon. Again, we fail, and then write the function.

In [34]:
def find_first_in_register_stop(seq):
return None

test_find_first_in_register_stop()

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-34-99ed189bf75f> in <module>
2     return None
3
----> 4 test_find_first_in_register_stop()

<ipython-input-27-5c76a7a3caff> in test_find_first_in_register_stop()
12
13 def test_find_first_in_register_stop():
---> 14     assert find_first_in_register_stop('') == -1
15     assert find_first_in_register_stop('GTAATAGTGA'.lower()) == -1
16     assert find_first_in_register_stop('AAAAAAAAAAAAAAATAAGGGTAA'.lower()) == 18

AssertionError: 

Nice, beautiful failure. Now, we'll write the function and test it. Again, I'll demonstrate the power of the re module.

In [35]:
def find_first_in_register_stop(seq):
"""
Find first stop codon on lowercase seq that starts at an index
that is divisible by three
"""
# Compile regexes for stop codons
regex_stop = re.compile('(taa|tag|tga)')

# Stop codon iterator
stop_iterator = regex_stop.finditer(seq)

# Find next stop codon that is in register
for stop in stop_iterator:
if stop.end() % 3 == 0:
return stop.end()

# Return -1 if we failed to find a stop codon
return -1


And the test...

In [36]:
test_find_first_in_register_stop()


Great! It passes. Now, I'll write it without regular expressions.

In [37]:
def find_first_in_register_stop(seq):
"""
Find first stop codon on seq that starts at an index
that is divisible by three
"""

seq = seq.lower()

# Scan sequence for stop codon
i = 0
while i < len(seq) - 2 and seq[i:i+3] not in ('taa', 'tag', 'tga'):
i += 3

# If before end, found codon, return end of codon
if i < len(seq) - 2:
return i + 3
else: # Failed to find stop codon
return -1


Let's test this function to make sure it works.

In [38]:
test_find_first_in_register_stop()


Passage! Finally, we apply TDD to write all_orfs().

In [39]:
def all_orfs(seq):
"""Return all ORFs of a sequence."""
# Make sure sequence is all lower case
seq = seq.lower()

# Find the indices of all start codons
start_inds = find_all_starts(seq)

# Keep track of stops
stop_inds = []

# Initialze ORFs.  Each entry in list is [ORF length, ORF start, ORF stop]
orfs = []

# For each start codon, find the next stop codon in register
for start in start_inds:
relative_stop = find_first_in_register_stop(seq[start:])

if relative_stop != -1:
# Index of stop codon
stop = start + relative_stop

if stop not in stop_inds:
orfs.append((relative_stop, start, stop))
stop_inds.append(stop)

# Get sorted list of ORF length
orfs = sorted(orfs, reverse=True)

# Remove lengths
for i, orf in enumerate(orfs):
orfs[i] = (orf[1], orf[2])

return tuple(orfs)


Now for the tests....

In [40]:
test_all_orfs()


Passage! We have succeed in generating an ordered list of the ORFs. Now, let's get what the problem specified, the longest ORF. Of course, we start with writing tests.

In [41]:
def test_longest_orf():
assert longest_orf('') == ''
assert longest_orf('GGAGACGACGCAAAAC') == ''
assert longest_orf('AAAAAAATGAAATGAGGGGGGTATG') == 'ATGAAATGA'
assert longest_orf('GGATGATGATGTAAAAC') == 'ATGATGATGTAA'
assert longest_orf('GGATGCATGATGTAGAAC') == 'ATGATGTAG'
assert longest_orf('GGGATGATGATGGGATGGTGAGTAGGGTAAG') == 'ATGATGATGGGATGGTGA'
assert (longest_orf('GGGatgatgatgGGatgGtgaGtagGGACtaaG')
== 'atgGtgaGtagGGACtaa')


We'll fail them....

In [42]:
def longest_orf(seq):
return None

test_longest_orf()

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-42-5408c7075a5c> in <module>
2     return None
3
----> 4 test_longest_orf()

<ipython-input-41-ea5298a7be36> in test_longest_orf()
1 def test_longest_orf():
----> 2     assert longest_orf('') == ''
3     assert longest_orf('GGAGACGACGCAAAAC') == ''
4     assert longest_orf('AAAAAAATGAAATGAGGGGGGTATG') == 'ATGAAATGA'
5     assert longest_orf('GGATGATGATGTAAAAC') == 'ATGATGATGTAA'

AssertionError: 

We'll write our function, and then test it.

In [43]:
def longest_orf(seq):
"""Longest ORF of a sequence."""
orfs = all_orfs(seq)

if len(orfs) == 0:
return ''
else:
return seq[orfs[0][0]:orfs[0][1]]

In [44]:
test_longest_orf()


Passage! Success! We now have a reliable function for computing the longest ORF.

b) We simply use our new function to find the longest ORF of the Salmonella sequence.

In [45]:
# Compute it
salmonella_orf = longest_orf(seq)

# Look at it
salmonella_orf

Out[45]:
'ATGGGTAAAGGAAGCAGTAAGGGGCATACCCCGCGCGAAGCGAAGGACAACCTGAAGTCCACGCAGTTGCTGAGTGTGATCGATGCCATCAGCGAAGGGCCGATTGAAGGTCCGGTGGATGGCTTAAAAAGCGTGCTGCTGAACAGTACGCCGGTGCTGGACACTGAGGGGAATACCAACATATCCGGTGTCACGGTGGTGTTCCGGGCTGGTGAGCAGGAGCAGACTCCGCCGGAGGGATTTGAATCCTCCGGCTCCGAGACGGTGCTGGGTACGGAAGTGAAATATGACACGCCGATCACCCGCACCATTACGTCTGCAAACATCGACCGTCTGCGCTTTACCTTCGGTGTACAGGCACTGGTGGAAACCACCTCAAAGGGTGACAGGAATCCGTCGGAAGTCCGCCTGCTGGTTCAGATACAACGTAACGGTGGCTGGGTGACGGAAAAAGACATCACCATTAAGGGCAAAACCACCTCGCAGTATCTGGCCTCGGTGGTGATGGGTAACCTGCCGCCGCGCCCGTTTAATATCCGGATGCGCAGGATGACGCCGGACAGCACCACAGACCAGCTGCAGAACAAAACGCTCTGGTCGTCATACACTGAAATCATCGATGTGAAACAGTGCTACCCGAACACGGCACTGGTCGGCGTGCAGGTGGACTCGGAGCAGTTCGGCAGCCAGCAGGTGAGCCGTAATTATCATCTGCGCGGGCGTATTCTGCAGGTGCCGTCGAACTATAACCCGCAGACGCGGCAATACAGCGGTATCTGGGACGGAACGTTTAAACCGGCATACAGCAACAACATGGCCTGGTGTCTGTGGGATATGCTGACCCATCCGCGCTACGGCATGGGGAAACGTCTTGGTGCGGCGGATGTGGATAAATGGGCGCTGTATGTCATCGGCCAGTACTGCGACCAGTCAGTGCCGGACGGCTTTGGCGGCACGGAGCCGCGCATCACCTGTAATGCGTACCTGACCACACAGCGTAAGGCGTGGGATGTGCTCAGCGATTTCTGCTCGGCGATGCGCTGTATGCCGGTATGGAACGGGCAGACGCTGACGTTCGTGCAGGACCGACCGTCGGATAAGACGTGGACCTATAACCGCAGTAATGTGGTGATGCCGGATGATGGCGCGCCGTTCCGCTACAGCTTCAGCGCCCTGAAGGACCGCCATAATGCCGTTGAGGTGAACTGGATTGACCCGAACAACGGCTGGGAGACGGCGACAGAGCTTGTTGAAGATACGCAGGCCATTGCCCGTTACGGTCGTAATGTTACGAAGATGGATGCCTTTGGCTGTACCAGCCGGGGGCAGGCACACCGCGCCGGGCTGTGGCTGATTAAAACAGAACTGCTGGAAACGCAGACCGTGGATTTCAGCGTCGGCGCAGAAGGGCTTCGCCATGTACCGGGCGATGTTATTGAAATCTGCGATGATGACTATGCCGGTATCAGCACCGGTGGTCGTGTGCTGGCGGTGAACAGCCAGACCCGGACGCTGACGCTCGACCGTGAAATCACGCTGCCATCCTCCGGTACCGCGCTGATAAGCCTGGTTGACGGAAGTGGCAATCCGGTCAGCGTGGAGGTTCAGTCCGTCACCGACGGCGTGAAGGTAAAAGTGAGCCGTGTTCCTGACGGTGTTGCTGAATACAGCGTATGGGAGCTGAAGCTGCCGACGCTGCGCCAGCGACTGTTCCGCTGCGTGAGTATCCGTGAGAACGACGACGGCACGTATGCCATCACCGCCGTGCAGCATGTGCCGGAAAAAGAGGCCATCGTGGATAACGGGGCGCACTTTGACGGCGAACAGAGTGGCACGGTGAATGGTGTCACGCCGCCAGCGGTGCAGCACCTGACCGCAGAAGTCACTGCAGACAGCGGGGAATATCAGGTGCTGGCGCGATGGGACACACCGAAGGTGGTGAAGGGCGTGAGTTTCCTGCTCCGTCTGACCGTAACAGCGGACGACGGCAGTGAGCGGCTGGTCAGCACGGCCCGGACGACGGAAACCACATACCGCTTCACGCAACTGGCGCTGGGGAACTACAGGCTGACAGTCCGGGCGGTAAATGCGTGGGGGCAGCAGGGCGATCCGGCGTCGGTATCGTTCCGGATTGCCGCACCGGCAGCACCGTCGAGGATTGAGCTGACGCCGGGCTATTTTCAGATAACCGCCACGCCGCATCTTGCCGTTTATGACCCGACGGTACAGTTTGAGTTCTGGTTCTCGGAAAAGCAGATTGCGGATATCAGACAGGTTGAAACCAGCACGCGTTATCTTGGTACGGCGCTGTACTGGATAGCCGCCAGTATCAATATCAAACCGGGCCATGATTATTACTTTTATATCCGCAGTGTGAACACCGTTGGCAAATCGGCATTCGTGGAGGCCGTCGGTCGGGCGAGCGATGATGCGGAAGGTTACCTGGATTTTTTCAAAGGCAAGATAACCGAATCCCATCTCGGCAAGGAGCTGCTGGAAAAAGTCGAGCTGACGGAGGATAACGCCAGCAGACTGGAGGAGTTTTCGAAAGAGTGGAAGGATGCCAGTGATAAGTGGAATGCCATGTGGGCTGTCAAAATTGAGCAGACCAAAGACGGCAAACATTATGTCGCGGGTATTGGCCTCAGCATGGAGGACACGGAGGAAGGCAAACTGAGCCAGTTTCTGGTTGCCGCCAATCGTATCGCATTTATTGACCCGGCAAACGGGAATGAAACGCCGATGTTTGTGGCGCAGGGCAACCAGATATTCATGAACGACGTGTTCCTGAAGCGCCTGACGGCCCCCACCATTACCAGCGGCGGCAATCCTCCGGCCTTTTCCCTGACACCGGACGGAAAGCTGACCGCTAAAAATGCGGATATCAGTGGCAGTGTGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACGGTACGCTGAGGGCGGAAAAAATCGTCGGGGACATTGTAAAGGCGGCGAGCGCGGCTTTTCCGCGCCAGCGTGAAAGCAGTGTGGACTGGCCGTCAGGTACCCGTACTGTCACCGTGACCGATGACCATCCTTTTGATCGCCAGATAGTGGTGCTTCCGCTGACGTTTCGCGGAAGTAAGCGTACTGTCAGCGGCAGGACAACGTATTCGATGTGTTATCTGAAAGTACTGATGAACGGTGCGGTGATTTATGATGGCGCGGCGAACGAGGCGGTACAGGTGTTCTCCCGTATTGTTGACATGCCAGCGGGTCGGGGAAACGTGATCCTGACGTTCACGCTTACGTCCACACGGCATTCGGCAGATATTCCGCCGTATACGTTTGCCAGCGATGTGCAGGTTATGGTGATTAAGAAACAGGCGCTGGGCATCAGCGTGGTCTGA'

c) We can use the codons dictionary in the bootcamp_utils package to do the translation. With this in hand, we can write our translate() function. We will scan the DNA sequence, generate a list of amino acids, and then join them into a protein sequence.

In [46]:
def translate(seq):
"""Translate a DNA sequence into protein"""
# Make sure sequence is upper case
seq = seq.upper()

# Find start codon
i = 0
while i < len(seq) + 2 and seq[i:i+3] != 'ATG':
i += 1

# Translate until the stop codon or end of string
prot = []
while i < len(seq) - 2 and seq[i:i+3] not in ('TAA', 'TGA', 'TAG'):
prot.append(bootcamp_utils.codons[seq[i:i+3]])
i += 3

return ''.join(prot)


d) We can now translate the protein

In [47]:
translate(salmonella_orf)

Out[47]:
'MGKGSSKGHTPREAKDNLKSTQLLSVIDAISEGPIEGPVDGLKSVLLNSTPVLDTEGNTNISGVTVVFRAGEQEQTPPEGFESSGSETVLGTEVKYDTPITRTITSANIDRLRFTFGVQALVETTSKGDRNPSEVRLLVQIQRNGGWVTEKDITIKGKTTSQYLASVVMGNLPPRPFNIRMRRMTPDSTTDQLQNKTLWSSYTEIIDVKQCYPNTALVGVQVDSEQFGSQQVSRNYHLRGRILQVPSNYNPQTRQYSGIWDGTFKPAYSNNMAWCLWDMLTHPRYGMGKRLGAADVDKWALYVIGQYCDQSVPDGFGGTEPRITCNAYLTTQRKAWDVLSDFCSAMRCMPVWNGQTLTFVQDRPSDKTWTYNRSNVVMPDDGAPFRYSFSALKDRHNAVEVNWIDPNNGWETATELVEDTQAIARYGRNVTKMDAFGCTSRGQAHRAGLWLIKTELLETQTVDFSVGAEGLRHVPGDVIEICDDDYAGISTGGRVLAVNSQTRTLTLDREITLPSSGTALISLVDGSGNPVSVEVQSVTDGVKVKVSRVPDGVAEYSVWELKLPTLRQRLFRCVSIRENDDGTYAITAVQHVPEKEAIVDNGAHFDGEQSGTVNGVTPPAVQHLTAEVTADSGEYQVLARWDTPKVVKGVSFLLRLTVTADDGSERLVSTARTTETTYRFTQLALGNYRLTVRAVNAWGQQGDPASVSFRIAAPAAPSRIELTPGYFQITATPHLAVYDPTVQFEFWFSEKQIADIRQVETSTRYLGTALYWIAASINIKPGHDYYFYIRSVNTVGKSAFVEAVGRASDDAEGYLDFFKGKITESHLGKELLEKVELTEDNASRLEEFSKEWKDASDKWNAMWAVKIEQTKDGKHYVAGIGLSMEDTEEGKLSQFLVAANRIAFIDPANGNETPMFVAQGNQIFMNDVFLKRLTAPTITSGGNPPAFSLTPDGKLTAKNADISGSVNANSGTLSNVTIAENCTINGTLRAEKIVGDIVKAASAAFPRQRESSVDWPSGTRTVTVTDDHPFDRQIVVLPLTFRGSKRTVSGRTTYSMCYLKVLMNGAVIYDGAANEAVQVFSRIVDMPAGRGNVILTFTLTSTRHSADIPPYTFASDVQVMVIKKQALGISVV'

Doing a BLAST search on this protein indicates that it is a histidine kinase involved in signaling.

e) We already are computing all of the ORFs. We an therefore just add a kwarg to our longest_orf() function to get the n longest ones.

In [48]:
def longest_orf(seq, n=1):
"""Longest ORF of a sequence."""
orfs = all_orfs(seq)

if len(orfs) == 0:
return ''
elif n == 1 or len(orfs) == 1:
return seq[orfs[0][0]:orfs[0][1]]
else:
return_list = []
for i in range(min(n, len(orfs))):
return_list.append(seq[orfs[i][0]:orfs[i][1]])
return tuple(return_list)


Now, we'll compute the ORFs, translate them, and make a FASTA file to submit for a BLAST search.

In [49]:
# Compute ORFs
orfs = longest_orf(seq, n=5)

# Translate them
prots = []
for orf in orfs:
prots.append(translate(orf))

# Make a FASTA file
with open('sal_seqs.faa', 'w') as f:
for i, prot in enumerate(prots):
f.write('> {0:d}\n'.format(i))
f.write(prot + '\n')


We can take a look at it to see what I did with the above code.

In [50]:
!cat sal_seqs.faa

> 0
> 1
> 2
> 3
> 4


Upon doing the BLAST search, I found that the genes are:

Length rank Description
1 histine kinase
2 DNA repair protein MutS
3 formate hydrogenlyase transcriptional activator
4 L-fucose isomerase
5 transcriptional regulator HilA (invasion regulator)

Computing environment¶

In [51]:
%load_ext watermark
%watermark -v -p re,numpy,jupyterlab

CPython 3.7.3
IPython 7.1.1

re 2.2.1
numpy 1.16.4
jupyterlab 0.35.5