Exercise 4.1: Pathogenicity islands¶
This exercise was inspired by Libeskind-Hadas and Bush, Computing for Biologists, Cambridge University Press, 2014.
For this and the next problem, we will work with real data from the Salmonella enterica genome. The section of the genome we will work with is in the file ~git/bootcamp/data/salmonella_spi1_region.fna
. I cut it out of the full genome. It contains Salmonella pathogenicity island I (SPI1), which contains genes for surface receptors for host-pathogen interactions.
Pathogenicity islands are often marked by different GC content than the rest of the genome. We will try to locate the pathogenicity island(s) in our section of the Salmonella genome by computing GC content.
a) Use principles of TDD to write a function that divides a sequence into blocks and computes the GC content for each block, returning a tuple. The function signature should look like
To be clear, if seq = 'ATGACTACGT'
and block_size = 4
, the blocks to be considered are
and the function should return (0.25, 0.5)
. Note that the blocks are non-overlapping and that we don’t bother with the fact that end of the sequence that does not fit completely in a block.
b) Write a function that takes as input a sequence, block size, and a threshold GC content, and returns the original sequence where every base in a block with GC content above threshold is capitalized and every base below the threshold is lowercase. You would call the function like this:
For example,
returns 'atgaCTAC'
. Note that bases not included in GC blocks (in this case the GT
at the end of the sequence) are not included in the output sequence. Again, use principles of TDD.
c) Use the gc_map()
function to generate a GC content map for the Salmonella sequence with block_size = 1000
and gc_thresh = 0.45
. Where do you think the pathogenicity island is?
d) Write the GC-mapped sequence (with upper and lower characters) to a new FASTA file. Use the same description line (which began with a >
in the original FASTA file), and have line breaks every 60 characters in the sequence.