Exercise 10.1: Pathogenicity islands

This exercise was inspired by Libeskind-Hadas and Bush, Computing for Biologists, Cambridge University Press, 2014.


For this problem, we will work with real data from the Salmonella enterica genome. The section of the genome we will work with is in the file ~git/bootcamp/data/salmonella_spi1_region.fna. I cut it out of the full genome. It contains Salmonella pathogenicity island I (SPI1), which contains genes for surface receptors for host-pathogen interactions.

Pathogenicity islands are often marked by different GC content than the rest of the genome. We will try to locate the pathogenicity island(s) in our section of the Salmonella genome by computing GC content.

a) Use principles of TDD to write a function with call signature gc_content(seq) that takes in a sequence and computes the GC content. It should return the fraction of bases in the sequence that are either G or C.

b) Again using principles of TDD, write a function with call signature gc_blocks(seq, block_size) that takes as input a sequence and a block size. Your function should have error checking to make sure len(seq) >= block_size. The function returns a Numpy array of length len(seq) - block_size + 1 where entry i is the GC content of subsequence seq[i:i+block_size]. Hint: When doing tests on floating point results, the np.allclose() and np.isclose() functions are useful.

c) Use the gc_blocks() function to compute the GC content of the SPI1 sequence with a block size of 1000 bases. Then, plot the GC content as a function of index in the sequence. Where do you think the pathogenicity islands are?