{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 3.2: Longest common substring\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write a function that takes two sequences and returns the longest common substring. A substring is a contiguous portion of a string. For example:\n", "\n", "Substrings of `ATGCATAT`:\n", "\n", " TGCA\n", " T\n", " TAT\n", " \n", "Not substrings of `ATGCATAT`:\n", "\n", " AGCA # Skipped T\n", " CCATA # Added another C\n", " Hello, world. # Has nothing to do with the input sequence\n", " \n", "There may be more than one longest common substring; you only need to return one of them.\n", "\n", "The call signature of the function should be\n", "\n", "```python\n", "longest_common_substring(s1, s2)\n", "```\n", "\n", "Here are some return values you should get.\n", "\n", "|Function call|Result |\n", "|:---|---:|\n", "|`longest_common_substring('ATGC', 'ATGCA')` | `'ATGC'`|\n", "|`longest_common_substring('GATGCCATGCA', 'ATGCC')` | `'ATGCC'`|\n", "|`longest_common_substring('ACGTGGAAAGCCA', 'GTACACACGTTTTGAGAGACAC') `|`'ACGT'`|" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution\n", "\n", "This is actually an [important problem](https://en.wikipedia.org/wiki/Longest_common_substring_problem), and there are clever algorithms to solve it. We will take a more brute force approach. Let $n$ be the length of the shorter of the two strings. We will start with the entirety of the shorter string and see if it is in the longer. We then will take both substrings of length $n - 1$ in the shorter string and check to see if they are in the longer string. We then take all three substrings of length $n - 2$ and see if they are in the longer string. We continue like this until we get a hit, which will necessarily be one of the longest common substrings." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "def longest_common_substring(s1, s2):\n", " \"\"\"Return one of the longest common substrings\"\"\"\n", " # Make sure s1 is the shorter\n", " if len(s1) > len(s2):\n", " s1, s2 = s2, s1\n", " \n", " # Start with the entire sequence and shorten\n", " substr_len = len(s1)\n", " while substr_len > 0: \n", " # Try all substrings\n", " for i in range(len(s1) - substr_len + 1):\n", " if s1[i:i+substr_len] in s2:\n", " return s1[i:i+substr_len]\n", "\n", " substr_len -= 1\n", " \n", " # If we haven't returned, there is no common substring\n", " return ''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try our function out." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ACGT'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "longest_common_substring('ACGTGGAAAGCCA', 'GTACACACGTTTTGAGAGACAC')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing environment" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPython 3.7.7\n", "IPython 7.13.0\n", "\n", "jupyterlab 1.2.6\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -v -p jupyterlab" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }