Exercise 4.1: Pathogenicity islands¶
This exercise was inspired by Libeskind-Hadas and Bush, Computing for Biologists, Cambridge University Press, 2014.
For this and the next problem, we will work with real data from the Salmonella enterica genome. The section of the genome we will work with is in the file ~git/bootcamp/data/salmonella_spi1_region.fna
. I cut it out of the full genome. It contains Salmonella pathogenicity island I (SPI1), which contains genes for surface receptors for host-pathogen interactions.
Pathogenicity islands are often marked by different GC content than the rest of the genome. We will try to locate the pathogenicity island(s) in our section of the Salmonella genome by computing GC content.
a) Use principles of TDD to write a function that divides a sequence into blocks and computes the GC content for each block, returning a tuple. The function signature should look like
gc_blocks(seq, block_size)
To be clear, if seq = 'ATGACTACGT'
and block_size = 4
, the blocks to be considered are
ATGA
CTAC
and the function should return (0.25, 0.5)
. Note that the blocks are non-overlapping and that we don’t bother with the fact that end of the sequence that does not fit completely in a block.
b) Write a function that takes as input a sequence, block size, and a threshold GC content, and returns the original sequence where every base in a block with GC content above threshold is capitalized and every base below the threshold is lowercase. You would call the function like this:
mapped_seq = gc_map(seq, block_size, gc_thresh)
For example,
gc_map('ATGACTACGT', 4, 0.4)
returns 'atgaCTAC'
. Note that bases not included in GC blocks (in this case the GT
at the end of the sequence) are not included in the output sequence. Again, use principles of TDD.
c) Use the gc_map()
function to generate a GC content map for the Salmonella sequence with block_size = 1000
and gc_thresh = 0.45
. Where do you think the pathogenicity island is?
d) Write the GC-mapped sequence (with upper and lower characters) to a new FASTA file. Use the same description line (which began with a >
in the original FASTA file), and have line breaks every 60 characters in the sequence.
Solution¶
[1]:
import numpy as np
a) First let’s write our tests. In writing the tests, I am making the design decision that I will count the characters G
, g
, C
, and c
as contributing to GC content, and that I will not check to make sure the sequence is valid. I also make the design decision that an empty sequence has zero GC content.
[2]:
def test_gc_content():
assert gc_content('') == 0.0
assert gc_content('G') == 1.0
assert gc_content('g') == 1.0
assert gc_content('C') == 1.0
assert gc_content('c') == 1.0
assert gc_content('gcgcgc') == 1.0
assert gc_content('aaatatata') == 0.0
assert np.isclose(gc_content('ggatcggcga'), 0.7)
assert np.isclose(gc_content('attgggggcaatta'), 3/7)
The function is fairly simple. We loop through the sequence with a stride equal to the block size, computing the GC content for each subsequence of that length. We start with a function to compute GC content for a sequence.
[3]:
def gc_content(seq):
"""GC content of a given sequence"""
if seq == '':
return 0.0
seq = seq.upper()
return (seq.count('G') + seq.count('C')) / len(seq)
Now let’s test it.
[4]:
test_gc_content()
Passage! Next, we write the looping function, starting with its tests.
[5]:
def test_gc_blocks():
assert gc_blocks('', 10) == tuple()
assert gc_blocks('gcgcgcgcg', 10) == tuple()
assert gc_blocks('gcgcgcg', 4) == (1.0,)
assert gc_blocks('gcgcgcgc', 4) == (1.0, 1.0)
assert gc_blocks('gcgcgcgcat', 4) == (1.0, 1.0)
test_tuple = gc_blocks('gcgagcgcat', 4)
assert np.isclose(test_tuple[0], 0.75) and test_tuple[1] == 1.0
assert gc_blocks('gcgtagagc', 1) == (1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0)
assert gc_blocks('gcgtagagc', 2) == (1.0, 0.5, 0.5, 0.5)
assert np.isclose(gc_blocks('gcgtagagc', 3), (1.0, 1/3, 2/3)).sum() == 3
Now let’s write our function.
[6]:
def gc_blocks(seq, block_size):
"""
Divide sequence into non-overlapping blocks
and compute GC content of each block.
"""
blocks = []
for i in range(0, len(seq) - (len(seq) % block_size), block_size):
blocks.append(gc_content(seq[i:i+block_size]))
return tuple(blocks)
And the tests….
[7]:
test_gc_blocks()
Success! Let’s take this function for a spin, looking at 1000-base blocks. We will use the FASTA reader function from a previous exercise to read in the Salmonella genome fragment.
[8]:
def read_fasta(filename):
"""Read a sequence in from a FASTA file containing a single sequence.
We assume that the first line of the file is the descriptor and all
subsequent lines are sequence.
"""
with open(filename, 'r') as f:
# Read in descriptor
descriptor = f.readline().rstrip()
# Read in sequence, stripping the whitespace from each line
seq = ''
line = f.readline().rstrip()
while line != '':
seq += line
line = f.readline().rstrip()
return descriptor, seq
descriptor, seq = read_fasta('data/salmonella_spi1_region.fna')
gc_blocks(seq, 1000)
[8]:
(0.521,
0.556,
0.54,
0.498,
0.551,
0.508,
0.563,
0.484,
0.58,
0.557,
0.523,
0.524,
0.621,
0.556,
0.481,
0.57,
0.581,
0.614,
0.603,
0.526,
0.524,
0.591,
0.563,
0.596,
0.563,
0.6,
0.613,
0.594,
0.486,
0.554,
0.566,
0.592,
0.563,
0.537,
0.575,
0.501,
0.54,
0.555,
0.487,
0.416,
0.423,
0.371,
0.394,
0.48,
0.454,
0.474,
0.434,
0.396,
0.37,
0.456,
0.409,
0.457,
0.4,
0.405,
0.475,
0.47,
0.479,
0.494,
0.497,
0.516,
0.444,
0.433,
0.471,
0.458,
0.53,
0.458,
0.56,
0.427,
0.47,
0.438,
0.465,
0.473,
0.46,
0.399,
0.426,
0.359,
0.469,
0.433,
0.425,
0.504,
0.578,
0.576,
0.553,
0.531,
0.57,
0.599,
0.562,
0.555,
0.595,
0.586,
0.55,
0.56,
0.545,
0.553,
0.537,
0.519,
0.519,
0.567,
0.551,
0.548,
0.559,
0.527,
0.559,
0.529,
0.49,
0.533,
0.58,
0.545,
0.558,
0.575,
0.555,
0.49,
0.567,
0.515,
0.518,
0.485,
0.38,
0.461,
0.568,
0.575,
0.567,
0.57,
0.472,
0.513,
0.582,
0.476,
0.505,
0.524,
0.51,
0.512,
0.391,
0.463,
0.57,
0.546,
0.535,
0.525,
0.525,
0.529,
0.58,
0.555,
0.558,
0.563,
0.525,
0.505,
0.557,
0.554,
0.484,
0.525,
0.567,
0.467,
0.527,
0.55,
0.577,
0.554,
0.538,
0.429,
0.507,
0.557,
0.592,
0.595,
0.554,
0.521,
0.539,
0.521,
0.45,
0.608,
0.489,
0.477,
0.552,
0.508,
0.544,
0.495,
0.543,
0.56,
0.596,
0.547,
0.581,
0.548,
0.537,
0.529,
0.513,
0.499,
0.545,
0.567,
0.52,
0.545,
0.548,
0.522,
0.533,
0.558,
0.586,
0.469,
0.516,
0.509,
0.511,
0.569,
0.575,
0.559,
0.545,
0.502)
We get a tuple of GC content, which is hard to look at on screen, but this is useful for plotting GC content over the course of a sequence. We will learn how to plot later in the bootcamp.
b) We just use our already-written gc_content()
function to decide how to modify the string of the sequence. First, the tests. We make the design decision that we will truncate the sequence if the 3’-most end is shorter than the block length.
[9]:
def test_gc_map():
assert gc_map('', 10, 0.5) == ''
assert gc_map('ATATATATA', 4, 0.5) == 'atatatat'
assert gc_map('GCGCGCGCG', 4, 0.5) == 'GCGCGCGC'
assert gc_map('GATCGATCC', 4, 0.5) == 'GATCGATC'
assert gc_map('GATCGATCC', 4, 0.51) == 'gatcgatc'
assert gc_map('GATCGATCC', 3, 0.5) == 'gatCGATCC'
assert gc_map('GATCGATCC', 3, 0.75) == 'gatcgatcc'
assert gc_map('GATCGATCC', 3, 0.25) == 'GATCGATCC'
Now the function….
[10]:
def gc_map(seq, block_size, gc_thresh):
"""Give back seq with lowercase letters where GC content is low."""
out_seq = ''
# Determine GC content of each block and change string accordingly
for i in range(0, len(seq) - (len(seq) % block_size), block_size):
if gc_content(seq[i:i+block_size]) < gc_thresh:
out_seq += seq[i:i+block_size].lower()
else:
out_seq += seq[i:i+block_size].upper()
return out_seq
And the tests.
[11]:
test_gc_map()
Passage! We can now use these functions to analyze sequences of interest.
c) Let’s do it for Salmonella!
[12]:
sal_gcmap = gc_map(seq, 1000, 0.45)
To save on display space, we will not display the sequence here. Scrolling through the GC map file generated in the next part, the pathogenicity island appears to occur about a quarter of the way into this subsequence.
d) To write the file out, we use the fact that we conveniently kept the description text when we parsed the Salmonella FASTA file in the first place. We then just write the sal_gcmap
string in blocks of 60. We have to make sure to get the last few bases as well.
[13]:
# Write the result
with open('salmonella_spi1_region_gc_map.fna', 'w') as f:
# Write description text
f.write(descriptor + '\n')
# Write sequence in blocks of 60
i = 0
while i < len(sal_gcmap) - 59:
f.write(sal_gcmap[i:i+60] + '\n')
i += 60
# Write last line
f.write(sal_gcmap[i:] + '\n')
We’ll take a quick look to see it worked out ok.
[14]:
!head salmonella_spi1_region_gc_map.fna
print('...')
!tail salmonella_spi1_region_gc_map.fna
>gi|821161554|gb|CP011428.1| Salmonella enterica subsp. enterica strain YU39, complete genome, subsequence 3000000 to 3200000
AAAACCTTAGTAACTGGACTGCTGGGATTTTTCAGCCTGGATACGCTGGTAGATCTCTTC
ACGATGGACAGAAACTTCTTTCGGGGCGTTCACGCCAATACGCACCTGGTTGCCCTTCAC
CCCTAAAACTGTCACGGTGACCTCATCGCCAATCATGAGGGTCTCACCAACTCGACGAGT
CAGAATCAGCATTCTTTGCTCCTTGAAAGATTAAAAGAGTCGGGTCTCTCTGTATCCCGG
CATTATCCATCATATAACGCCAAAAAGTAAGCGATGACAAACACCTTAGGTGTAAGCAGT
CATGGCATTACATTCTGTTAAACCTAAGTTTAGCCGATATACAAAACTTCAACCTGACTT
TATCGTTGTCGATAGCGTTGACGTAAACGCCGCAGCACGGGCTGCGGCGCCAACGAACGC
TTATAATTATTGCAATTTTGCGCTGACCCAGCCTTGTACACTGGCTAACGCTGCAGGCAG
AGCTGCCGCATCCGTACCACCGGCTTGCGCCATGTCCGGACGACCGCCACCCTTACCGCC
...
ACGCATTTCTCCCGTGCAGGTCACATTTGCCCGACACGGCGGGGCAAGAGGCTTGAACAG
ACGTTCATTTTCCGTAAAACTGGCGTAATGTAAGCGTTTACCCACTATAGGTATTATCAT
GGCGACCATAAAAGATGTAGCCCGACTGGCCGGTGTTTCAGTCGCCACCGTTTCTCGCGT
TATTAACGATTCGCCAAAAGCCAGCGAAGCGTCCCGGCTGGCGGTAACCAGCGCAATGGA
GTCCCTGAGCTATCACCCTAACGCCAACGCGCGCGCGCTGGCACAGCAGGCAACGGAAAC
CCTCGGTCTGGTGGTCGGCGACGTTTCCGATCCTTTTTTCGGCGCGATGGTGAAAGCCGT
TGAACAGGTGGCGTATCACACCGGCAATTTTTTACTGATTGGCAACGGGTATCATAACGA
ACAAAAAGAGCGTCAGGCTATTGAACAGTTGATTCGTCATCGTTGCGCAGCGTTAGTGGT
GCACGCCAAAATGATTCCGGATGCGGACCTGGCCTCATTAATGAAGCAAATCCCCGGCAT
GGTGCTGATTAACCGCATTT
Looks good!
Computing environment¶
[15]:
%load_ext watermark
%watermark -v -p numpy,jupyterlab
Python implementation: CPython
Python version : 3.8.10
IPython version : 7.22.0
numpy : 1.20.2
jupyterlab: 3.0.14