{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 4.2: Restriction enzyme cut sites\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**[Restriction enzymes](https://en.wikipedia.org/wiki/Restriction_enzyme)** cut DNA at specific locations called **restriction sites**. The sequence at a restriction site is called a **recognition sequence**. Here are the recognition sequences of some commonly used restriction enzymes.\n",
    "\n",
    "|Restriction enzyme | Recognition sequence|\n",
    "|:----|:----|\n",
    "|[HindIII](https://en.wikipedia.org/wiki/HindIII) | `AAGCTT` |\n",
    "|[EcoRI](https://en.wikipedia.org/wiki/EcoRI)| `GAATTC` |\n",
    "|KpnI| `GGTACC` |\n",
    "\n",
    "\n",
    "**a)** [New England Biosystems](https://www.neb.com/products/n3011-lambda-dna#Product%20Information) sells purified DNA of the genome of λ-phage, a bacteriophage that infect _E. coli_. You can download the FASTA file containing the sequence [here](https://www.neb.com/-/media/nebus/page-images/tools-and-resources/interactive-tools/dna-sequences-and-maps/text-documents/lambdafsa.txt). Use the function you wrote in [Exercise 3.1](exercise_4.1.ipynb) to extract the sequence.\n",
    "\n",
    "**b)** Write a function with call signature\n",
    "\n",
    "```python\n",
    "restriction_sites(seq, recoq_seq)\n",
    "```\n",
    "\n",
    "that takes as arguments a sequence and the recognition sequence of a restriction enzyme sites and returns the indices of the first base or each of the restriction sites in the sequence. Use this function to find the indices of the restriction sites of λ-DNA for HindIII, EcoRI, and KpnI. Compare your results to those reported on the [New England Biosystems datasheet](https://www.neb.com/-/media/nebus/page-images/tools-and-resources/interactive-tools/dna-sequences-and-maps/lambda_sites.pdf)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Solution\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**a)** I have downloaded the FASTA file to `data/lambdaDNA.fasta`. Let's load it in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_fasta(filename):\n",
    "    \"\"\"Read a sequence in from a FASTA file containing a single sequence.\n",
    "    \n",
    "    We assume that the first line of the file is the descriptor and all\n",
    "    subsequent lines are sequence.    \n",
    "    \"\"\"\n",
    "    with open(filename, 'r') as f:\n",
    "        # Read in descriptor\n",
    "        descriptor = f.readline().rstrip()\n",
    "\n",
    "        # Read in sequence, stripping the whitespace from each line\n",
    "        seq = ''\n",
    "        line = f.readline().rstrip()\n",
    "        while line != '':\n",
    "            seq += line\n",
    "            line = f.readline().rstrip()\n",
    "            \n",
    "    return descriptor, seq\n",
    "\n",
    "\n",
    "_, seq = read_fasta('data/lambdaDNA.fasta')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can check to make sure we got the whole sequence, which should be about 48.5 kbp."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "48502"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(seq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks good!\n",
    "\n",
    "**b)** Our goal is to find all locations of the substring given by the recognition sequence in the genome. This is most easily accomplished using the `re` module that uses **[regular expressions](https://en.wikipedia.org/wiki/Regular_expression)**, which we will not cover in the bootcamp. Specifically, we use `re.finditer()` to automatically find all occurrences of the recognition sequence in the sequence, and then use the `start()` method to get the first index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def restriction_sites_with_re(seq, recog_seq):\n",
    "    \"\"\"Find the indices of all restriction sites in a sequence.\"\"\"\n",
    "    sites = []\n",
    "    for site in re.finditer(recog_seq, seq):\n",
    "        sites.append(site.start())\n",
    "        \n",
    "    return sites"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check our restriction sites against the know sites."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HindIII: [23129, 25156, 27478, 36894, 37458, 37583, 44140]\n",
      "EcoRI:   [21225, 26103, 31746, 39167, 44971]\n",
      "KpnI:    [17052, 18555]\n"
     ]
    }
   ],
   "source": [
    "print('HindIII:', restriction_sites_with_re(seq, 'AAGCTT'))\n",
    "print('EcoRI:  ', restriction_sites_with_re(seq, 'GAATTC'))\n",
    "print('KpnI:   ', restriction_sites_with_re(seq, 'GGTACC'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mostly, these match exactly the listed values, except each index reported by our function is one less than the listed values because we are indexing starting at zero. The one example that is different from the data sheet is that the data sheet does not contain the cut site at index 37583. This could be because NEB made a mistake either in the sequencing or in determining their cut sites, or because they validated the cut sites experimentally, and for some reason there is some strange structure around that cut site that precludes cutting. It could also be that because there is only a short fragment of DNA between the cut site at 37458 and the one they missed at 37583 that is it missed in a gel electrophoresis assay.\n",
    "\n",
    "I will now write a function do to the same calculation without using regular expressions. We will simply make a pass through the sequence and store the indices where we match the recognition sequence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def restriction_sites(seq, recog_seq):\n",
    "    \"\"\"Find the indices of all restriction sites in a sequence.\"\"\"\n",
    "    # Initialize list of restriction sites\n",
    "    sites = []\n",
    "\n",
    "    # Check every substring for a match\n",
    "    for i in range(len(seq) - len(recog_seq)):\n",
    "        if seq[i:i+len(recog_seq)] == recog_seq:\n",
    "            sites.append(i)\n",
    "        \n",
    "    return sites"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And let's use this function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "HindIII: [23129, 25156, 27478, 36894, 37458, 37583, 44140]\n",
      "EcoRI:   [21225, 26103, 31746, 39167, 44971]\n",
      "KpnI:    [17052, 18555]\n"
     ]
    }
   ],
   "source": [
    "print('HindIII:', restriction_sites(seq, 'AAGCTT'))\n",
    "print('EcoRI:  ', restriction_sites(seq, 'GAATTC'))\n",
    "print('KpnI:   ', restriction_sites(seq, 'GGTACC'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For fun, we can compare the speeds of the two functions using the magic function `%timeit`. This function performs a calculation many times and computes a mean execution time. It can be used to check speed of your functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9.03 ms ± 44.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n",
      "165 µs ± 2.75 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%timeit restriction_sites(seq, 'AAGCTT')\n",
    "%timeit restriction_sites_with_re(seq, 'AAGCTT')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The regular expression-based method is nearly 100 times faster! This is often the case; hand-coding something in pure Python can be slow compared to using packages that use pre-compiled code like `re`. However, as [Donald Knuth](https://en.wikipedia.org/wiki/Donald_Knuth) said, \"The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; **premature optimization is the root of all evil (or at least most of it) in programming.**\" Get your code working, even if it is slow, and optimize only if you have to."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPython 3.7.7\n",
      "IPython 7.16.1\n",
      "\n",
      "re 2.2.1\n",
      "jupyterlab 2.1.5\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark -v -p re,jupyterlab"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}