{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 2.2: Restriction enzyme cut sites\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**[Restriction enzymes](https://en.wikipedia.org/wiki/Restriction_enzyme)** cut DNA at specific locations called **restriction sites**. The sequence at a restriction site is called a **recognition sequence**. Here are the recognition sequences of some commonly used restriction enzymes.\n",
    "\n",
    "|Restriction enzyme | Recognition sequence|\n",
    "|:----|:----|\n",
    "|[HindIII](https://en.wikipedia.org/wiki/HindIII) | `AAGCTT` |\n",
    "|[EcoRI](https://en.wikipedia.org/wiki/EcoRI)| `GAATTC` |\n",
    "|KpnI| `GGTACC` |\n",
    "\n",
    "\n",
    "**a)** [New England Biosystems](https://www.neb.com/products/n3011-lambda-dna#Product%20Information) sells purified DNA of the genome of λ-phage, a bacteriophage that infect _E. coli_. You can download the FASTA file containing the sequence [here](https://www.neb.com/-/media/nebus/page-images/tools-and-resources/interactive-tools/dna-sequences-and-maps/text-documents/lambdafsa.txt). Use the function you wrote in [Exercise 2.1](exercise_2.1.ipynb) to extract the sequence.\n",
    "\n",
    "**b)** Write a function with call signature\n",
    "\n",
    "```python\n",
    "restriction_sites(seq, recoq_seq)\n",
    "```\n",
    "\n",
    "that takes as arguments a sequence and the recognition sequence of a restriction enzyme sites and returns the indices of the first base or each of the restriction sites in the sequence. Use this function to find the indices of the restriction sites of λ-DNA for HindIII, EcoRI, and KpnI. Compare your results with those given [here](https://www.bioinformatics.nl/molbi/SCLResources/LambdaBE_restrct_alphab.htm), which contain a comprehensive list of locations of restriction sites for a variety of enzymes.\n",
    "\n",
    "<!--Compare your results to those reported on the [New England Biosystems datasheet](https://www.neb.com/-/media/nebus/page-images/tools-and-resources/interactive-tools/dna-sequences-and-maps/lambda_sites.pdf).-->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Solution\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**a)** I have downloaded the FASTA file to `data/lambdaDNA.fasta`. Let's load it in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_fasta(filename):\n",
    "    \"\"\"Read a sequence in from a FASTA file containing a single sequence.\n",
    "    \n",
    "    We assume that the first line of the file is the descriptor and all\n",
    "    subsequent lines are sequence.    \n",
    "    \"\"\"\n",
    "    with open(filename, 'r') as f:\n",
    "        # Read in descriptor\n",
    "        descriptor = f.readline().rstrip()\n",
    "\n",
    "        # Read in sequence, stripping the whitespace from each line\n",
    "        seq = ''\n",
    "        line = f.readline().rstrip()\n",
    "        while line != '':\n",
    "            seq += line\n",
    "            line = f.readline().rstrip()\n",
    "            \n",
    "    return descriptor, seq\n",
    "\n",
    "\n",
    "_, seq = read_fasta('data/lambdaDNA.fasta')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can check to make sure we got the whole sequence, which should be about 48.5 kbp."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "48502"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(seq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks good!\n",
    "\n",
    "**b)** Our goal is to find all locations of the substring given by the recognition sequence in the genome. This is most easily accomplished using the `re` module that uses **[regular expressions](https://en.wikipedia.org/wiki/Regular_expression)**, which we will cover in an auxiliary lesson. Specifically, we use `re.finditer()` to automatically find all occurrences of the recognition sequence in the sequence, and then use the `start()` method to get the first index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def restriction_sites_with_re(seq, recog_seq):\n",
    "    \"\"\"Find the indices of all restriction sites in a sequence.\"\"\"\n",
    "    sites = []\n",
    "    for site in re.finditer(recog_seq, seq):\n",
    "        sites.append(site.start())\n",
    "        \n",
    "    return sites"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check our restriction sites against the know sites."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HindIII: [23129, 25156, 27478, 36894, 37458, 37583, 44140]\n",
      "EcoRI:   [21225, 26103, 31746, 39167, 44971]\n",
      "KpnI:    [17052, 18555]\n"
     ]
    }
   ],
   "source": [
    "print('HindIII:', restriction_sites_with_re(seq, 'AAGCTT'))\n",
    "print('EcoRI:  ', restriction_sites_with_re(seq, 'GAATTC'))\n",
    "print('KpnI:   ', restriction_sites_with_re(seq, 'GGTACC'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mostly, these match exactly the listed values, except each index reported by our function is one less than the listed values because we are indexing starting at zero. The one example that is different is that the file we are comparing to does not contain the cut site at index 37583. According to the [restriction map from NEB](https://www.neb.com/-/media/nebus/page-images/tools-and-resources/interactive-tools/dna-sequences-and-maps/lambda_map.pdf), NEB sells λ DNA that is a derivative of the strain λ cI857ind1 Sam7. This particular strain has a point mutation in the cI gene. Coincidentally, the point mutation results in an extra HindIII site at 37583. Note that this site is very close to the site at 37458, which means that you might not see the resulting short fragment of DNA between those two sites in a gel, since you might run that fragment off the gel.\n",
    "\n",
    "I will now write a function do to the same calculation without using regular expressions. We will simply make a pass through the sequence and store the indices where we match the recognition sequence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def restriction_sites(seq, recog_seq):\n",
    "    \"\"\"Find the indices of all restriction sites in a sequence.\"\"\"\n",
    "    # Initialize list of restriction sites\n",
    "    sites = []\n",
    "\n",
    "    # Check every substring for a match\n",
    "    for i in range(len(seq) - len(recog_seq)):\n",
    "        if seq[i:i+len(recog_seq)] == recog_seq:\n",
    "            sites.append(i)\n",
    "        \n",
    "    return sites"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And let's use this function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HindIII: [23129, 25156, 27478, 36894, 37458, 37583, 44140]\n",
      "EcoRI:   [21225, 26103, 31746, 39167, 44971]\n",
      "KpnI:    [17052, 18555]\n"
     ]
    }
   ],
   "source": [
    "print('HindIII:', restriction_sites(seq, 'AAGCTT'))\n",
    "print('EcoRI:  ', restriction_sites(seq, 'GAATTC'))\n",
    "print('KpnI:   ', restriction_sites(seq, 'GGTACC'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For fun, we can compare the speeds of the two functions using the magic function `%timeit`. This function performs a calculation many times and computes a mean execution time. It can be used to check speed of your functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9.44 ms ± 39.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n",
      "146 µs ± 235 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit restriction_sites(seq, 'AAGCTT')\n",
    "%timeit restriction_sites_with_re(seq, 'AAGCTT')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The regular expression-based method is nearly 100 times faster! This is often the case; hand-coding something in pure Python can be slow compared to using packages that use pre-compiled code like `re`. However, as [Donald Knuth](https://en.wikipedia.org/wiki/Donald_Knuth) said, \"The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; **premature optimization is the root of all evil (or at least most of it) in programming.**\" Get your code working, even if it is slow, and optimize only if you have to."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Python implementation: CPython\n",
      "Python version       : 3.9.12\n",
      "IPython version      : 8.3.0\n",
      "\n",
      "re        : 2.2.1\n",
      "jupyterlab: 3.3.2\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark -v -p re,jupyterlab"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}