Exercise 10.1: Pathogenicity islands

This exercise was inspired by Libeskind-Hadas and Bush, Computing for Biologists, Cambridge University Press, 2014.


For this problem, we will work with real data from the Salmonella enterica genome. The section of the genome we will work with is in the file ~git/bootcamp/data/salmonella_spi1_region.fna. I cut it out of the full genome. It contains Salmonella pathogenicity island I (SPI1), which contains genes for surface receptors for host-pathogen interactions.

Pathogenicity islands are often marked by different GC content than the rest of the genome. We will try to locate the pathogenicity island(s) in our section of the Salmonella genome by computing GC content.

a) Use principles of TDD to write a function with call signature gc_content(seq) that takes in a sequence and computes the GC content. It should return the fraction of bases in the sequence that are either G or C.

b) Again using principles of TDD, write a function with call signature gc_blocks(seq, block_size) that takes as input a sequence and a block size. Your function should have error checking to make sure len(seq) >= block_size. The function returns a Numpy array of length len(seq) - block_size + 1 where entry i is the GC content of subsequence seq[i:i+block_size]. Hint: When doing tests on floating point results, the np.allclose() and np.isclose() functions are useful.

c) Use the gc_blocks() function to compute the GC content of the SPI1 sequence with a block size of 1000 bases. Then, plot the GC content as a function of index in the sequence. Where do you think the pathogenicity islands are?

Solution


[1]:
import pytest

import numpy as np

import bokeh.plotting
import bokeh.io

bokeh.io.output_notebook()
Loading BokehJS ...

a) First let’s write our tests. In writing the tests, I am making the design decision that I will count the characters G, g, C, and c as contributing to GC content, and that I will not check to make sure the sequence is valid. I also make the design decision that an empty sequence has zero GC content.

[2]:
def test_gc_content():
    assert gc_content('') == 0.0
    assert gc_content('G') == 1.0
    assert gc_content('g') == 1.0
    assert gc_content('C') == 1.0
    assert gc_content('c') == 1.0
    assert gc_content('gcgcgc') == 1.0
    assert gc_content('aaatatata') == 0.0
    assert np.isclose(gc_content('ggatcggcga'), 0.7)
    assert np.isclose(gc_content('attgggggcaatta'), 3/7)

The function is fairly simple. We loop through the sequence with a stride equal to the block size, computing the GC content for each subsequence of that length. We start with a function to compute GC content for a sequence.

[3]:
def gc_content(seq):
    """GC content of a given sequence"""
    if seq == '':
        return 0.0

    seq = seq.upper()
    return (seq.count('G') + seq.count('C')) / len(seq)

Now let’s test it.

[4]:
test_gc_content()

Passage!

b) Next, we write the looping function, starting with its tests.

[5]:
def test_gc_blocks():
    with pytest.raises(RuntimeError) as excinfo:
        gc_blocks("", 1)
    excinfo.match("Sequence length less than block size.")

    with pytest.raises(RuntimeError) as excinfo:
        gc_blocks("", 0)
    excinfo.match("Block size cannot be less than 1.")

    with pytest.raises(RuntimeError) as excinfo:
        gc_blocks("gcgcgcgcg", 10)
    excinfo.match("Sequence length less than block size.")

    assert (
        np.allclose(gc_blocks("gcgcgcg", 4), 1.0)
        and len(gc_blocks("gcgcgcg", 4)) == 4
    )
    assert (
        np.allclose(gc_blocks("gcgcgcgc", 4), 1.0)
        and len(gc_blocks("gcgcgcgc", 4)) == 5
    )
    assert np.allclose(
        gc_blocks("gcgcgcgcat", 4),
        np.array([1.0, 1.0, 1.0, 1.0, 1.0, 0.75, 0.5]),
    )

    assert np.allclose(
        gc_blocks("gcgtagagc", 1),
        np.array([1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0]),
    )
    assert np.allclose(
        gc_blocks("gcgtagagc", 2),
        np.array([1.0, 1.0, 0.5, 0.0, 0.5, 0.5, 0.5, 1.0]),
    )
    assert np.allclose(
        gc_blocks("gcgtagagc", 3),
        np.array([1.0, 2 / 3, 1 / 3, 1 / 3, 1 / 3, 2 / 3, 2 / 3]),
    )

Now let’s write our function.

[6]:
def gc_blocks(seq, block_size):
    """
    Compute rolling GC content over a sequence.
    """
    if len(seq) < block_size:
        raise RuntimeError("Sequence length less than block size.")

    if block_size <= 0:
        raise RuntimeError("Block size cannot be less than 1.")

    # Loop through and compute GC content for each block of sequence
    gc = np.empty(len(seq) - block_size + 1)
    for i in range(len(seq) - block_size + 1):
        gc[i] = gc_content(seq[i:i+block_size])

    return gc

And the tests….

[7]:
test_gc_blocks()

Success! We now have a working function!

The problem with the function we have written, though, is that is it inefficient. It recomputes the GC content of each block_size-base block of bases. This is a lot of repeated calculation. Instead, we note that we could do the following:

  1. Compute the number of G’s and C’s in seq[0:block_size].

  2. The number of G’s and C’s in seq[1:block_size+1] is given by the number of bases in seq[0:block_size], which we already computed, plus 1 if seq[block_size] is a G or a C, minus 1 if seq[0] is a G or a C.

  3. We continue like this for each successive block, and divide by block_size at the end to get the GC content for each block.

This amounts to doing a rolling sum of the G-C content. A rolling sum computes the sum in every window (or block size) of an array. Let’s write a function to compute a rolling sum, starting, of course, with tests.

[8]:
def test_rolling_sum():
    assert np.allclose(rolling_sum(np.arange(10), 1), np.arange(10))
    assert np.allclose(rolling_sum(np.arange(10), 5), np.array([10, 15, 20, 25, 30, 35]))
    assert np.allclose(rolling_sum(np.array([1]), 1), np.array([1]))
    assert np.allclose(rolling_sum(np.array([1, -1, 1, -1, 1, -1]), 2), np.zeros(5))

Now, we can write the function.

[9]:
def rolling_sum(a, window):
    if len(a) < window:
        raise RuntimeError("Array length less than block size.")

    if window <= 0:
        raise RuntimeError("window cannot be less than 1.")

    # Initialize rolling sum
    rsum = np.empty(len(a) - window + 1)
    rsum[0] = np.sum(a[:window])

    # Populate sums
    for i in range(1, len(a) - window + 1):
        rsum[i] = rsum[i - 1] + a[i + window - 1] - a[i - 1]

    return rsum

And let’s test it!

[10]:
test_rolling_sum()

Very good! Now, to calculate GC content, we need to convert the sequence to an array of one’s and zeros to then apply a rolling sum. So, let’s re-write our gc_blocks() function using this technique.

[11]:
def gc_blocks(seq, block_size):
    """
    Compute rolling GC content over a sequence.
    """
    if len(seq) < block_size:
        raise RuntimeError("Sequence length less than block size.")

    if block_size <= 0:
        raise RuntimeError("Block size cannot be less than 1.")

    # Convert sequence to 1's and 0's
    a = np.empty(len(seq))
    for i, base in enumerate(seq):
        a[i] = gc_content(base)

    # Return rolling average
    return rolling_sum(a, block_size) / block_size

A great feature of the TDD approach is that you can run the same tests on your optimized function as you did in the first function you wrote. Let’s test this new, faster version.

[12]:
test_gc_blocks()

Success!

c) Let’s take this function for a spin, looking at 1000-base blocks. We will use the FASTA reader function from a previous exercise to read in the Salmonella genome fragment.

[13]:
def read_fasta(filename):
    """Read a sequence in from a FASTA file containing a single sequence.

    We assume that the first line of the file is the descriptor and all
    subsequent lines are sequence.
    """
    with open(filename, 'r') as f:
        # Read in descriptor
        descriptor = f.readline().rstrip()

        # Read in sequence, stripping the whitespace from each line
        seq = ''
        line = f.readline().rstrip()
        while line != '':
            seq += line
            line = f.readline().rstrip()

    return descriptor, seq


descriptor, seq = read_fasta('data/salmonella_spi1_region.fna')

gc = gc_blocks(seq, 1000)

Now that we have our GC content, we can plot it.

[14]:
p = bokeh.plotting.figure(
    frame_height=150,
    frame_width=500,
    x_axis_label='index (kb)',
    y_axis_label='GC content',
)

x = np.arange(len(gc)) / 1000
p.line(x, gc)

bokeh.io.show(p)

We can quite clearly see a pathogenicity island that begins at about 40 kb from the start of the sequence.

Computing environment

[15]:
%load_ext watermark
%watermark -v -p numpy,jupyterlab
CPython 3.7.7
IPython 7.16.1

numpy 1.18.5
jupyterlab 2.1.5