Exercise 4.2: ORF detection

(This exercise was inspired by Libeskind-Hadas and Bush, Computing for Biologists, Cambridge University Press, 2014.


For this exercise, again use principles of TDD to help you write your functions.

a) Write a function, longest_orf(), that takes a DNA sequence as input and finds the longest open reading frame (ORF) in the sequence (we will not consider reverse complements). A sequence fragment constitutes an ORF if the following are all true.

  1. It begins with ATG.

  2. It ends with any of TGA, TAG, or TAA.

  3. The total number of bases is a multiple of 3.

Note that the sequence ATG may appear in the middle of an ORF. So, for example,

GGATGATGATGTAAAAC

has two ORFs, ATGATGATGTAA and ATGATGTAA. You would return the first one, since it is longer of these two.

Hint: The statement for this problem is a bit ambiguous as it is written. What other specification might you need for this function?

b) Use your function to find the longest ORF from the section of the Salmonella genome we are investigating.

c) Write a function that converts a DNA sequence into a protein sequence. You can of course use the bootcamp_utils module.

d) Translate the longest ORF you generated in part (b) into a protein sequence and perform a BLAST search. Search for the protein sequence (a blastp query). What gene is it?

e) [Bonus challenge] Modify your function to return the n longest ORFs. Compute the five longest ORFs for the Salmonella genome section we are working with. Perform BLAST searches on them. What are they?