Exercise 1.5: Longest common substring
Write a function that takes two sequences and returns the longest common substring. A substring is a contiguous portion of a string. For example:
Substrings of ATGCATAT
:
TGCA
T
TAT
Not substrings of ATGCATAT
:
AGCA # Skipped T
CCATA # Added another C
Hello, world. # Has nothing to do with the input sequence
There may be more than one longest common substring; you only need to return one of them.
The call signature of the function should be
longest_common_substring(s1, s2)
Here are some return values you should get.
Function call |
Result |
---|---|
|
|
|
|
|
|
Solution
This is actually an important problem, and there are clever algorithms to solve it. We will take a more brute force approach. Let \(n\) be the length of the shorter of the two strings. We will start with the entirety of the shorter string and see if it is in the longer. We then will take both substrings of length \(n - 1\) in the shorter string and check to see if they are in the longer string. We then take all three substrings of length \(n - 2\) and see if they are in the longer string. We continue like this until we get a hit, which will necessarily be one of the longest common substrings.
[1]:
def longest_common_substring(s1, s2):
"""Return one of the longest common substrings"""
# Make sure s1 is the shorter
if len(s1) > len(s2):
s1, s2 = s2, s1
# Start with the entire sequence and shorten
substr_len = len(s1)
while substr_len > 0:
# Try all substrings
for i in range(len(s1) - substr_len + 1):
if s1[i:i+substr_len] in s2:
return s1[i:i+substr_len]
substr_len -= 1
# If we haven't returned, there is no common substring
return ''
Let’s try our function out.
[2]:
longest_common_substring('ACGTGGAAAGCCA', 'GTACACACGTTTTGAGAGACAC')
[2]:
'ACGT'
Computing environment
[3]:
%load_ext watermark
%watermark -v -p jupyterlab
Python implementation: CPython
Python version : 3.9.12
IPython version : 8.3.0
jupyterlab: 3.3.2