Lesson 37: Algorithmic complexity
[1]:
import itertools
import numpy as np
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
So far, we have written some of our own simple programs with loops where necessary. Most of the tasks we’ve coded up have been pretty straightforward and there is a more or less obvious way to approach things.
For some problems, and many important ones, the algorithm, or procedure, used to do the desired computation is important. A calculation done with one algorithm may be much much faster than when it is done with another.
Unless you are in the business of coming up with novel algorithms to solve your problem of interest, most of the algorithms you use in your work in biology will be those developed by someone else. Nonetheless, it is important for you to understand what goes into algorithms and that there can by huge differences in computational time depending on the algorithm used.
The longest common subsequence
As an illustration of this idea, we will work out an important algorithm in bioinformatics. Our goal is to compute the length of the longest common subsequence (LCS) between two sequences. (This is not the longest common substring.) A subsequence is a sequence that can be generated by deleting entries in the original sequence, but leaving the order unchanged. For example, the subsequences of length three of ATGC
are:
ATG
TGC
ATC
ATC
All subsequences contain one deletion. There are four deletions we could choose, so there are four subsequences. The LCS between two sequences is the subsequence(s) that both share that is the longest.
The LCS is useful to compare two sequences. Deletions and insertions are common throughout evolution, so the LCS give a good indication of how similar two sequences are, even if these insertions or deletions are present.
Tests!
As we develop an algorithm to compute the LCS between two subsequences, we should write tests to make sure our algorithm gives the right answer. Let’s write some! Whatever we call our function (we’ll call a generic one lcslen
), it should be called as lcslen(seq1, seq2)
and return an integer, the length of the longest common subsequence.
[2]:
def test_lcslen(lcslen):
"""Test a given function to computing LCS"""
# Define test sequences
seq1 = 'abcd'
seq2 = 'abcdefg'
seq3 = 'adef'
assert lcslen('', '') == 0
assert lcslen('', seq1) == 0
assert lcslen(seq1, seq1) == 4
assert lcslen(seq1, seq2) == 4
assert lcslen(seq1, seq3) == 2
assert lcslen(seq2, seq3) == 4
Let’s fail the test
Remember, under the principles of TDD, we need to fail the test first. Our first attempt will be a naive implementation, so we’ll call it naive_lcslen()
.
[3]:
def naive_lcslen(seq1, seq2):
"""
Return length of longest common subsequence
between two subsequences.
"""
return 0
And now we’ll fail the test.
[4]:
test_lcslen(naive_lcslen)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-4-da85baf6f332> in <module>
----> 1 test_lcslen(naive_lcslen)
<ipython-input-2-00e4ebc51190> in test_lcslen(lcslen)
8 assert lcslen('', '') == 0
9 assert lcslen('', seq1) == 0
---> 10 assert lcslen(seq1, seq1) == 4
11 assert lcslen(seq1, seq2) == 4
12 assert lcslen(seq1, seq3) == 2
AssertionError:
Naive strategy
Thinking off the top of my head, I think the following seems like a reasonable algorithm. Let \(m\) be the length of seq1
and \(n\) be the length of seq2
, with \(m \ge n\).
Generate all subsequences of
seq1
that are length \(n\).See if any of these subsequences match
seq2
.Generate all subsequences of
seq1
and ofseq2
of length \(q = n - 1\)Compare these to sets of subsequences and see if any of the subsequences match. If they do, return them.
Go to step 3, except decrement \(q\) by one each cycle until we get to \(q = 1\).
If no common subsequence is found when \(q=1\), return an empty string.
Finding all subsequences
To implement this strategy, we need a function to generate all subsequences of a given length. Its call signature will be
allsubseqs(seq, n)
and it will return a set
of all of the subsequences. A set
is a built-in Python data type that is like a list, but only contains the unique elements. It is unordered and immutable.
It turns out that we don’t need to write this function! The itertools.combinations() function in the standard library gives this functionality. We don’t need to test this (since the standard library is trustworthy; carefully unit tested), so we can just use it.
[5]:
def allsubseqs(seq, n):
"""Return a set containing all subsequences of length n"""
return set(itertools.combinations(seq, n))
Let’s give it a try.
[6]:
allsubseqs('abcd', 3)
[6]:
{('a', 'b', 'c'), ('a', 'b', 'd'), ('a', 'c', 'd'), ('b', 'c', 'd')}
The resulting set has tuples as elements, but that’s ok, since we can just compare tuples to other tuples.
Implementation of naive strategy
Now we’ll code up the main LCS function according to the algorithm we laid out in words.
[7]:
def naive_lcslen(seq1, seq2, return_lcs=False):
"""Naive implementation of LCS length calculator."""
# For convenience, make sure seq1 is longer than seq2
if len(seq1) < len(seq2):
seq1, seq2 = seq2, seq1
# Initialize length of subsequence to consider
q = len(seq2)
# Compare subsequences for a match
while q > 0:
# Generate subsequences
seq1_subseqs = allsubseqs(seq1, q)
seq2_subseqs = allsubseqs(seq2, q)
# Compute the intersections (common subsequences)
lcs_set = seq1_subseqs.intersection(seq2_subseqs)
# If the intersection is non-empty, we've found the LCS
if len(lcs_set) > 0:
break
# Decrement length of subsequence
q -= 1
# If we want to return the LCS, compute it
if return_lcs:
return q, tuple(["".join(lcs) for lcs in lcs_set])
else:
return q
Notice that since we end up computing the actual LCS, and not just its length, we can actually return all of the LCSs (since the LCS is not necessarily unique) we like. I therefore put in the return_lcs
kwarg and return the LCS as well. Now, let’s run our test on it to make sure it works.
[8]:
test_lcslen(naive_lcslen)
Hurray! It passed. Now, let’s try it with some biological-type sequences. We typically compare protein sequences, since we don’t want frame shifts. We’ll first code up a random protein sequence generator.
[9]:
rg = np.random.default_rng(3252)
def random_protein_sequence(n):
"""Generate a random protein sequence."""
return "".join(rg.choice(list("ACDEFGHIKLMNPQRSTVWY"), size=n))
Let’s use that to generate some test sequences and find the LCS. Actually, we only report one of the possibly many LCS, since it is not guaranteed to be unique.
[10]:
# Generate random sequences and compute LCS
seq1 = random_protein_sequence(20)
seq2 = random_protein_sequence(24)
len_lcs, lcs = naive_lcslen(seq1, seq2, return_lcs=True)
print("""
seq1: {seq1:s}
seq2: {seq2:s}
LCS length: {len_lcs:d}
""".format(seq1=seq1, seq2=seq2, len_lcs=len_lcs))
print("The LCS(s) is/are:", lcs)
seq1: SEAARAQFEKLDGKFKFPHH
seq2: IEHMSYSMKHRMNWLPSKVAMQHI
LCS length: 5
The LCS(s) is/are: ('ERLPH', 'ERAQH', 'EKLPH', 'SRLKH', 'ERLKH', 'SKLPH', 'SRLPH', 'SKLKH', 'EKLKH', 'SRAQH')
The naive implementation is sloooooow
Wow. That took a while, and we were only comparing 20-some bases. Why?
Let’s think about what the algorithm is doing, and let’s think about it in the worst-case scenario in which it does not find an LCS at all, meaning it has to go all the way through the while loop. At each iteration in the while
loop, the allsubseqs
function computes all subsequences of a given length. As you might guess from the name of the itertools.combinations()
function, there are
\begin{align} \begin{pmatrix} n\\k \end{pmatrix} = \frac{n!}{(n-k)! k!} \end{align}
subsequences of length \(k\) for a sequence of length \(n\). You can also see this by counting how many deletions you need to make to get a subsequence. There are
\begin{align} \begin{pmatrix} n\\n-k \end{pmatrix} = \begin{pmatrix} n\\k \end{pmatrix} \end{align}
such deletions. This number of combinations can be large. For example, to find all subsequence of length 10 in a sequence of length 20, we would have to generate
\begin{align} \begin{pmatrix} 20\\10 \end{pmatrix} = 184,756 \end{align}
subsequences (twice) and then compare them. In the worst case, the total number of subsequences you would have to generate is
\begin{align} 2 \sum_{k=0}^n \begin{pmatrix} n\\k \end{pmatrix} = 2^{n+1}. \end{align}
This can be huge! For typical protein sequences of length \(n = 300\), this amounts to \(10^{90}\) generated subsequences! The biggest set of these has \(10^{88}\) subsequences in it. This is more than the number of atoms in the observable universe. We could never even store the subsequences!
Time complexity
Because the size of the problem, \(n\), appears in the exponent in the number of operations required for our naive algorithm, it is said to have exponential time complexity. Exponential time algorithms are sometimes necessary, but many problems can be solved much more efficiently with more clever algorithms.
Dynamic programming solution
Let’s take a new strategy. We’ll do a divide-and-conquer strategy. We’ll start small. We start be considering the tiniest substring for each sequence and compare them. Note the difference: a substring does not have any deletions; a subsequence does. We’ll then keep comparing longer and longer substrings, computing the length of the LCS, until we get to the whole sequence of each. The hope is that we can use the information of smaller substrings as we build up to avoid re-computing things. Let’s map out our algorithm.
We have an \(m+1 \times n+1\) matrix called lcs_mat
. Entry lcs_mat[i, j]
stores the length of the LCS for substrings seq1[0:i]
and seq2[0:j]
.
Let’s say we are looking at entry \((i, j)\), as we have already computed the entries for all rows less than \(i\) and all columns less than \(j\).
Now, if seq1[i-1] == seq2[j-1]
, we know that
lcs_mat[i,j] = lcs_mat[i-1, j-1] + 1.
Why? Because we just matched another character, so we extend the subsequence length by one.
But, if seq1[i-1] != seq2[j-1]
, we cannot have a longer common subsequence than we have already detected. Either the longest common subsequence of the substrings (seq1[0:i-1]
, seq2[0:j]
) or of (seq1[0:i]
, seq2[0:j-1]
) is still the longest.
The cool thing is that we can trivially compute the first row and column of lcs_mat
. They’re all zeros, since no substring can match. So, we just keep moving i
and j
forward, checking to see if there is a sequence match at positions i,j
or using results we already computed.
An example lcs_mat
is shown below. We first fill out the first row and first column, and then proceed going left to right one row at a time.
Ø |
|
|
|
|
|
---|---|---|---|---|---|
Ø |
0 |
0 |
0 |
0 |
0 |
|
0 |
0 |
1 |
0 |
0 |
|
0 |
0 |
1 |
1 |
1 |
|
0 |
0 |
1 |
2 |
2 |
This style of building up a solution to a big problem from smaller subproblems is called dynamic programming. The name is an odd historical artifact. It’s not especially “dynamic,” it’s just a name. Let’s code this up in Python!
[11]:
def dp_lcslen(seq1, seq2):
"""Length of LCS using a dynamic program."""
# Initialize matrix of substring LCSs.
lcs_mat = np.zeros((len(seq1) + 1, len(seq2) + 1), dtype=int)
# Build matrix, left to right, row by row
for i in range(1, len(seq1) + 1):
for j in range(1, len(seq2) + 1):
if seq1[i - 1] == seq2[j - 1]:
lcs_mat[i, j] = lcs_mat[i - 1, j - 1] + 1
else:
lcs_mat[i, j] = max(lcs_mat[i - 1, j], lcs_mat[i, j - 1])
return lcs_mat[-1, -1]
Let’s run it through the unit test!
[12]:
test_lcslen(dp_lcslen)
Excellent! It tests out! Now, let’s try it out on our protein sequences.
[13]:
dp_lcslen(seq1, seq2)
[13]:
5
We got the correct answer, and it was way faster!
Time complexity of the dynamic program
To get the time complexity of the dynamic program, we again think about how many operations we need to do. We build a matrix which has about \(n^2\) entries (where we assume \(m \approx n\). Each entry consists of a simple comparison of two characters and then either an addition of 1
or fetching another value. So, we really only have roughly \(n^2\) operations to achieve the result. So, the dynamic programming algorithm has a time complexity of \(n^2\). Let’s compare how long
things will take with the respective algorithms as a function of the size of the problem, \(n\).
[14]:
n = np.arange(2, 1001)
expon_complexity = 2 ** n
polynomial_complexity = n ** 2
p = bokeh.plotting.figure(
x_axis_label="n",
y_axis_label="time to compute (a.u.)",
x_axis_type="log",
y_axis_type="log",
frame_height=225,
frame_width=325,
x_range=[n.min(), n.max()],
y_range=[0.1, 1e11],
)
p.line(n, expon_complexity, line_width=2, legend_label="naive")
p.line(n, polynomial_complexity, line_width=2, color='orange', legend_label="dynamic program")
p.legend.location = 'bottom_right'
bokeh.io.show(p)
We have no prayer of computing large sequences using the naive approach. The dynamic program makes the problem feasable.
Getting the LCS
Our dynamic program is really fast for finding the length of the LCS, but what about the LCS itself? In many dynamic programs, we can do backtracking, where we go back through the algorithm and extract information. We can do the same with the LCS algorithm. We start at the bottom of the lcs_mat
, and then work backwards, adding a character whenever it was added in the LCS length algorithm.
[15]:
def dp_lcslen(seq1, seq2, return_lcs=False):
"""Length of LCS using a dynamic program."""
# Initialize matrix of substring LCSs.
lcs_mat = np.zeros((len(seq1) + 1, len(seq2) + 1), dtype=int)
# Build matrix, left to right, row by row
for i in range(1, len(seq1) + 1):
for j in range(1, len(seq2) + 1):
if seq1[i - 1] == seq2[j - 1]:
lcs_mat[i, j] = lcs_mat[i - 1, j - 1] + 1
else:
lcs_mat[i, j] = max(lcs_mat[i - 1, j], lcs_mat[i, j - 1])
if return_lcs:
lcs = ""
i = len(seq1)
j = len(seq2)
while i != 0 and j != 0:
if lcs_mat[i, j] == lcs_mat[i - 1, j]:
i -= 1
elif lcs_mat[i, j] == lcs_mat[i, j - 1]:
j -= 1
else:
lcs = seq1[i - 1] + lcs
i -= 1
j -= 1
return lcs_mat[-1, -1], lcs
else:
return lcs_mat[-1, -1]
Let’s give it a whirl!
[16]:
dp_lcslen(seq1, seq2, True)
[16]:
(5, 'SRAQH')
The dynamic program only returns one of the LCSs, since we did not enumerate them all.
Examples of dynamic programs
Dynamic programming wherein we build up a solution to a big problem from smaller subproblems is pervasive in biology. Here are some examples of problems that use dynamic programming.
Global alignment (Needleman-Wunsch)
Local alignment (Smith-Waterman)
Some multiple sequence alignment algorithms (though most use other methods)
RNA structure prediction
Structural alignment (such as SSAP)
A word of warning about algorithmic development
In LCS example, we found that practical calculations the LCS length were impossible with our first algorithm because of time and storage limitations. However, the algorithm works. In general, when you are writing code, first get something that works. That is, something that passes your unit tests and does what you want it to do. Then decide if it needs optimization. It is quite possible that a “naive” algorithm might be just find for your purposes. If this is the case, don’t optimize. I think it’s important to remember what Donald Knuth, one of the great computer scientists of all time, said.
Premature optimization is the root of all evil.
This goes also for what language you write your code in. Coding something up in Python or some other high level interpreted language is typically much easier than in a compiled language like C or Fortran. Your code is also more portable. Only when you really need the speed should you go for one of those languages.
Computing environment
[17]:
%load_ext watermark
%watermark -v -p numpy,bokeh,jupyterlab
Python implementation: CPython
Python version : 3.8.10
IPython version : 7.22.0
numpy : 1.20.2
bokeh : 2.3.2
jupyterlab: 3.0.14