Exercise 2.2: Longest common substring


Write a function that takes two sequences and returns the longest common substring. A substring is a contiguous portion of a string. For example:

Substrings of ATGCATAT:

TGCA
T
TAT

Not substrings of ATGCATAT:

AGCA              # Skipped T
CCATA             # Added another C
Hello, world.     # Has nothing to do with the input sequence

There may be more than one longest common substring; you only need to return one of them.

The call signature of the function should be

longest_common_substring(s1, s2)

Here are some return values you should get.

Function call

Result

longest_common_substring('ATGC', 'ATGCA')

'ATGC'

longest_common_substring('GATGCCATGCA', 'ATGCC')

'ATGCC'

longest_common_substring('ACGTGGAAAGCCA', 'GTACACACGTTTTGAGAGACAC')

'ACGT'

Solution


This is actually an important problem, and there are clever algorithms to solve it. We will take a more brute force approach. Let \(n\) be the length of the shorter of the two strings. We will start with the entirety of the shorter string and see if it is in the longer. We then will take both substrings of length \(n - 1\) in the shorter string and check to see if they are in the longer string. We then take all three substrings of length \(n - 2\) and see if they are in the longer string. We continue like this until we get a hit, which will necessarily be one of the longest common substrings.

[1]:
def longest_common_substring(s1, s2):
    """Return one of the longest common substrings"""
    # Make sure s1 is the shorter
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    # Start with the entire sequence and shorten
    substr_len = len(s1)
    while substr_len > 0:
        # Try all substrings
        for i in range(len(s1) - substr_len + 1):
            if s1[i:i+substr_len] in s2:
                return s1[i:i+substr_len]

        substr_len -= 1

    # If we haven't returned, there is no common substring
    return ''

Let’s try our function out.

[2]:
longest_common_substring('ACGTGGAAAGCCA', 'GTACACACGTTTTGAGAGACAC')
[2]:
'ACGT'

Computing environment

[3]:
%load_ext watermark
%watermark -v -p jupyterlab
Python implementation: CPython
Python version       : 3.8.10
IPython version      : 7.22.0

jupyterlab: 3.0.14