Exercise 2.2: Restriction enzyme cut sites

Restriction enzymes cut DNA at specific locations called restriction sites. The sequence at a restriction site is called a recognition sequence. Here are the recognition sequences of some commonly used restriction enzymes.

Restriction enzyme	Recognition sequence
HindIII	`AAGCTT`
EcoRI	`GAATTC`
KpnI	`GGTACC`

a) New England Biosystems sells purified DNA of the genome of λ-phage, a bacteriophage that infect E. coli. You can download the FASTA file containing the sequence here. Use the function you wrote in Exercise 2.1 to extract the sequence.

b) Write a function with call signature

restriction_sites(seq, recoq_seq)

that takes as arguments a sequence and the recognition sequence of a restriction enzyme sites and returns the indices of the first base or each of the restriction sites in the sequence. Use this function to find the indices of the restriction sites of λ-DNA for HindIII, EcoRI, and KpnI. Compare your results with those given here, which contain a comprehensive list of locations of restriction sites for a variety of enzymes.

Solution

[1]:

import re

a) I have downloaded the FASTA file to data/lambdaDNA.fasta. Let’s load it in.

[2]:

def read_fasta(filename):
    """Read a sequence in from a FASTA file containing a single sequence.

    We assume that the first line of the file is the descriptor and all
    subsequent lines are sequence.
    """
    with open(filename, 'r') as f:
        # Read in descriptor
        descriptor = f.readline().rstrip()

        # Read in sequence, stripping the whitespace from each line
        seq = ''
        line = f.readline().rstrip()
        while line != '':
            seq += line
            line = f.readline().rstrip()

    return descriptor, seq


_, seq = read_fasta('data/lambdaDNA.fasta')

We can check to make sure we got the whole sequence, which should be about 48.5 kbp.

[3]:

len(seq)

[3]:

Looks good!

b) Our goal is to find all locations of the substring given by the recognition sequence in the genome. This is most easily accomplished using the re module that uses regular expressions, which we will cover in an auxiliary lesson. Specifically, we use re.finditer() to automatically find all occurrences of the recognition sequence in the sequence, and then use the start() method to get the first index.

[4]:

def restriction_sites_with_re(seq, recog_seq):
    """Find the indices of all restriction sites in a sequence."""
    sites = []
    for site in re.finditer(recog_seq, seq):
        sites.append(site.start())

    return sites

Let’s check our restriction sites against the know sites.

[5]:

print('HindIII:', restriction_sites_with_re(seq, 'AAGCTT'))
print('EcoRI:  ', restriction_sites_with_re(seq, 'GAATTC'))
print('KpnI:   ', restriction_sites_with_re(seq, 'GGTACC'))

HindIII: [23129, 25156, 27478, 36894, 37458, 37583, 44140]
EcoRI:   [21225, 26103, 31746, 39167, 44971]
KpnI:    [17052, 18555]

Mostly, these match exactly the listed values, except each index reported by our function is one less than the listed values because we are indexing starting at zero. The one example that is different is that the file we are comparing to does not contain the cut site at index 37583. According to the restriction map from NEB, NEB sells λ DNA that is a derivative of the strain λ cI857ind1 Sam7. This particular strain has a point mutation in the cI gene. Coincidentally, the point mutation results in an extra HindIII site at 37583. Note that this site is very close to the site at 37458, which means that you might not see the resulting short fragment of DNA between those two sites in a gel, since you might run that fragment off the gel.

I will now write a function do to the same calculation without using regular expressions. We will simply make a pass through the sequence and store the indices where we match the recognition sequence.

[6]:

def restriction_sites(seq, recog_seq):
    """Find the indices of all restriction sites in a sequence."""
    # Initialize list of restriction sites
    sites = []

    # Check every substring for a match
    for i in range(len(seq) - len(recog_seq)):
        if seq[i:i+len(recog_seq)] == recog_seq:
            sites.append(i)

    return sites

And let’s use this function.

[7]:

print('HindIII:', restriction_sites(seq, 'AAGCTT'))
print('EcoRI:  ', restriction_sites(seq, 'GAATTC'))
print('KpnI:   ', restriction_sites(seq, 'GGTACC'))

HindIII: [23129, 25156, 27478, 36894, 37458, 37583, 44140]
EcoRI:   [21225, 26103, 31746, 39167, 44971]
KpnI:    [17052, 18555]

For fun, we can compare the speeds of the two functions using the magic function %timeit. This function performs a calculation many times and computes a mean execution time. It can be used to check speed of your functions.

[8]:

%timeit restriction_sites(seq, 'AAGCTT')
%timeit restriction_sites_with_re(seq, 'AAGCTT')

9.44 ms ± 39.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
146 µs ± 235 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The regular expression-based method is nearly 100 times faster! This is often the case; hand-coding something in pure Python can be slow compared to using packages that use pre-compiled code like re. However, as Donald Knuth said, “The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.” Get your code working, even if it is slow, and optimize only if you have to.

Computing environment

[9]:

%load_ext watermark
%watermark -v -p re,jupyterlab

Python implementation: CPython
Python version       : 3.9.12
IPython version      : 8.3.0

re        : 2.2.1
jupyterlab: 3.3.2