Exercise 4.2: Restriction enzyme cut sites


Restriction enzymes cut DNA at specific locations called restriction sites. The sequence at a restriction site is called a recognition sequence. Here are the recognition sequences of some commonly used restriction enzymes.

Restriction enzyme

Recognition sequence

HindIII

AAGCTT

EcoRI

GAATTC

KpnI

GGTACC

a) New England Biosystems sells purified DNA of the genome of λ-phage, a bacteriophage that infect E. coli. You can download the FASTA file containing the sequence here. Use the function you wrote in Exercise 3.1 to extract the sequence.

b) Write a function with call signature

restriction_sites(seq, recoq_seq)

that takes as arguments a sequence and the recognition sequence of a restriction enzyme sites and returns the indices of the first base or each of the restriction sites in the sequence. Use this function to find the indices of the restriction sites of λ-DNA for HindIII, EcoRI, and KpnI. Compare your results to those reported on the New England Biosystems datasheet.

Solution


[1]:
import re

a) I have downloaded the FASTA file to data/lambdaDNA.fasta. Let’s load it in.

[2]:
def read_fasta(filename):
    """Read a sequence in from a FASTA file containing a single sequence.

    We assume that the first line of the file is the descriptor and all
    subsequent lines are sequence.
    """
    with open(filename, 'r') as f:
        # Read in descriptor
        descriptor = f.readline().rstrip()

        # Read in sequence, stripping the whitespace from each line
        seq = ''
        line = f.readline().rstrip()
        while line != '':
            seq += line
            line = f.readline().rstrip()

    return descriptor, seq


_, seq = read_fasta('data/lambdaDNA.fasta')

We can check to make sure we got the whole sequence, which should be about 48.5 kbp.

[3]:
len(seq)
[3]:
48502

Looks good!

b) Our goal is to find all locations of the substring given by the recognition sequence in the genome. This is most easily accomplished using the re module that uses regular expressions, which we will not cover in the bootcamp. Specifically, we use re.finditer() to automatically find all occurrences of the recognition sequence in the sequence, and then use the start() method to get the first index.

[4]:
def restriction_sites_with_re(seq, recog_seq):
    """Find the indices of all restriction sites in a sequence."""
    sites = []
    for site in re.finditer(recog_seq, seq):
        sites.append(site.start())

    return sites

Let’s check our restriction sites against the know sites.

[5]:
print('HindIII:', restriction_sites_with_re(seq, 'AAGCTT'))
print('EcoRI:  ', restriction_sites_with_re(seq, 'GAATTC'))
print('KpnI:   ', restriction_sites_with_re(seq, 'GGTACC'))
HindIII: [23129, 25156, 27478, 36894, 37458, 37583, 44140]
EcoRI:   [21225, 26103, 31746, 39167, 44971]
KpnI:    [17052, 18555]

Mostly, these match exactly the listed values, except each index reported by our function is one less than the listed values because we are indexing starting at zero. The one example that is different from the data sheet is that the data sheet does not contain the cut site at index 37583. This could be because NEB made a mistake either in the sequencing or in determining their cut sites, or because they validated the cut sites experimentally, and for some reason there is some strange structure around that cut site that precludes cutting. It could also be that because there is only a short fragment of DNA between the cut site at 37458 and the one they missed at 37583 that is it missed in a gel electrophoresis assay.

I will now write a function do to the same calculation without using regular expressions. We will simply make a pass through the sequence and store the indices where we match the recognition sequence.

[6]:
def restriction_sites(seq, recog_seq):
    """Find the indices of all restriction sites in a sequence."""
    # Initialize list of restriction sites
    sites = []

    # Check every substring for a match
    for i in range(len(seq) - len(recog_seq)):
        if seq[i:i+len(recog_seq)] == recog_seq:
            sites.append(i)

    return sites

And let’s use this function.

[7]:
print('HindIII:', restriction_sites(seq, 'AAGCTT'))
print('EcoRI:  ', restriction_sites(seq, 'GAATTC'))
print('KpnI:   ', restriction_sites(seq, 'GGTACC'))
HindIII: [23129, 25156, 27478, 36894, 37458, 37583, 44140]
EcoRI:   [21225, 26103, 31746, 39167, 44971]
KpnI:    [17052, 18555]

For fun, we can compare the speeds of the two functions using the magic function %timeit. This function performs a calculation many times and computes a mean execution time. It can be used to check speed of your functions.

[8]:
%timeit restriction_sites(seq, 'AAGCTT')
%timeit restriction_sites_with_re(seq, 'AAGCTT')
9.03 ms ± 44.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
165 µs ± 2.75 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The regular expression-based method is nearly 100 times faster! This is often the case; hand-coding something in pure Python can be slow compared to using packages that use pre-compiled code like re. However, as Donald Knuth said, “The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.” Get your code working, even if it is slow, and optimize only if you have to.

Computing environment

[9]:
%load_ext watermark
%watermark -v -p re,jupyterlab
CPython 3.7.7
IPython 7.16.1

re 2.2.1
jupyterlab 2.1.5