Exercise 2

This exercise was generated from a Jupyter notebook. You can download the notebook here.

Problems 2.2 and 2.3 were inspired by Libeskind-Hadas and Bush, Computing for Biologists, Cambridge University Press, 2014.

For all problems, we will use our functions on real data from the Salmonella enterica genome. You can download the section of the genome we will work with here. I cut if out of the full genome. It contains Salmonella pathogenicity island I (SPI1), which is contains genes for surface receptors for host-pathogen interactions.

Problem 2.1: Parsing a FASTA file

In later tutorials, we will use some tools that facilitate parsing files in bioinformatics. In this problem, though, we will work on our file I/O skills. First, download the FASTA file containing the section of the Salmonella genome we will be working with.

a) Use command line tools to put the FASTA file in a directory where you can work with it. Use command line tools to investigate the file. You will notice that the first line begins with a >, signifying that the line contains information about the sequence. The remainder of the lines are the sequence itself.

b) Use the file I/O skill you have learned to read in the sequence and store it as a single string with no gaps.

Problem 2.2: Pathogenicity islands

Pathogenicity islands are often marked by different GC content than the rest of the genome. We will try to locate the pathogenicity island(s) in our section of the Salmonella genome by computing GC content.

a) Write a function that divides a sequence into blocks and computes the GC content for each block, returning a tuple. The function signature should look like

gc_blocks(seq, block_size)

To be clear, if seq = 'ATGACTACGT and block_size = 4, the blocks to be considered are

ATGA
CTAC

and the function should return (0.25, 0.5). Note that the blocks are non-overlapping and that we don't bother with the end of the sequence that does not fit completely in a block.

b) Write a function that takes as input a sequence, block size, and a theshold GC content, and returns the original sequence where every base in a block with GC content above threshold is capitalized and every base below the threshold is lowercase. You would call the function like this:

mapped_seq = gc_map(seq, block_size, gc_thresh)

For example,

gc_map('ATGACTACGT', 4, 0.4)

returns 'atgaCTAC'. Note that bases not included in GC blocks are truncated.

c) Use the gc_map() function to generate a GC content map for the Salmonella sequence with block_size = 1000 and gc_thresh = 0.45. Where do you think the pathogenicity island is?

d) Write the GC-mapped sequence (with upper and lower characters) to a new FASTA file. Use the same description line (which began with a > in the original FASTA file) and have line breaks every 60 characters in the sequence.

Problem 2.3: ORF detection

a) Write a function, longest_orf(), that takes a DNA sequence as input and finds the longest open reading frame (ORF) in the sequence (we will not consider reverse complements). A sequence fragment constitutes an ORF if the following are true.

  1. It begins with ATG.
  2. It ends with any of TGA, TAG, or TAA.
  3. The total number of bases is a multiple of 3.

Note that the sequence ATG may appear in the middle of an ORF. So, for example,

GGATGATGATGTAAAAC

has two ORFs, ATGATGATGTAA and ATGATGTAA. You would return the first one, since it is longer of these two.

Use TDD principles. That is, be sure to write unit tests for this function.

Hint: The statement for this problem is a bit ambiguous as it is written. What other specification might need for this function. If you can't think of this right away, start writing tests, and it might become apparent to you.

b) Use your function to find the longest ORF from the section of the Salmonella genome we are investigating.

c) Write a function that converts a DNA sequence into a protein sequence. The dictionary we developed in Lesson 11 will help you.

d) Translate the longest ORF you generated in part (b) and perform a BLAST search. Search for the protein sequence (a blastp query). What gene is it?

e) [extra credit] Modify your function to return the $n$ longest ORFs. Compute the five longest ORFs for the Salmonella genome section we are working with. Perform BLAST searches on them. What are they?