{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise 4.1: Parsing a FASTA file\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are packages, like [Biopython](http://biopython.org/) and [scikit-bio](http://scikit-bio.org) for processing files you encounter in bioinformatics. In this problem, though, we will work on our file I/O skills.  \n",
    "\n",
    "**a)** Use command line tools to investigate the [FASTA file](https://en.wikipedia.org/wiki/FASTA_format) located at `~git/bootcamp/data/salmonella_spi1_region.fna`. This file contains a portion of the _Salmonella_ genome.\n",
    "\n",
    "You will notice that the first line begins with a `>`, signifying that the line contains information about the sequence. The remainder of the lines are the sequence itself.\n",
    "\n",
    "**b)** The format of the _Salmonella_ SPI1 region FASTA file is a common format for such files (though oftentimes FASTA files contain multiple sequences). Use the file I/O skills you have learned to write a function to read in a sequence from a FASTA file containing a single sequence (but possibly having the first line in the file beginning with `>`). Your function should take as input the name of the FASTA file and return two strings. First, it should return the descriptor string (which starts with `>`). Second, it should return a string with no gaps containing the sequence.\n",
    "\n",
    "Test your function on the _Salmonella_ sequence."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<br />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Solution\n",
    "\n",
    "**a)** A quick look using `less` indicates that this is a valid FASTA file. I did a quick check to see how many sequences there are by searching for the `>` symbol using `grep`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">gi|821161554|gb|CP011428.1| Salmonella enterica subsp. enterica strain YU39, complete genome, subsequence 3000000 to 3200000\n"
     ]
    }
   ],
   "source": [
    "!grep \">\" data/salmonella_spi1_region.fna"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can do the same for the λ DNA file we will use in [Exercise 4.2](exercise_4.2.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">Lambda_NEB\n"
     ]
    }
   ],
   "source": [
    "!grep \">\" data/lambdaDNA.fasta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that in both there is a single record. So, when we read it in, we just need to skip the first line and read on until we get the full record."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**b)** We will read in the files line by line, saving the first line as the descriptor, and then building the sequence as we go along."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_fasta(filename):\n",
    "    \"\"\"Read a sequence in from a FASTA file containing a single sequence.\n",
    "    \n",
    "    We assume that the first line of the file is the descriptor and all\n",
    "    subsequent lines are sequence.    \n",
    "    \"\"\"\n",
    "    with open(filename, 'r') as f:\n",
    "        # Read in descriptor\n",
    "        descriptor = f.readline().rstrip()\n",
    "\n",
    "        # Read in sequence, stripping the whitespace from each line\n",
    "        seq = ''\n",
    "        line = f.readline().rstrip()\n",
    "        while line != '':\n",
    "            seq += line\n",
    "            line = f.readline().rstrip()\n",
    "            \n",
    "    return descriptor, seq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, when writing this function, we should be taking a test-driven development approach (and we will talk about TDD in the last week of the bootcamp), but here we are focusing on our file I/O and string handling skills.\n",
    "\n",
    "Let's take the function for a drive!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'AAAACCTTAGTAACTGGACTGCTGGGATTTTTCAGCCTGGATACGCTGGTAGATCTCTTCACGATGGACAGAAACTTCTTTCGGGGCGTTCACGCCAATACGCACCTGGTTGCCCTTCACCCCTAAAACTGTCACGGTGACCTCATCGCCAATCATGAGGGTCTCACCAACTCGACGAGTCAGAATCAGCATTCTTTGCTCCTTGAAAGATTAAAAGAGTCGGGTCTCTCTGTATCCCGGCATTATCCATCATATAACGCCAAAAAGTAAGCGATGACAAACACCTTAGGTGTAAGCAGTCATGGCATTACATTCTGTTAAACCTAAGTTTAGCCGATATACAAAACTTCAACCTGACTTTATCGTTGTCGATAGCGTTGACGTAAACGCCGCAGCACGGGCTGCGGCGCCAACGAACGCTTATAATTATTGCAATTTTGCGCTGACCCAGCCTTGTACACTGGCTAACGCTGCAGGCAGAGCTGCCGCATCCGTACCAC'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "descriptor, seq = read_fasta('data/salmonella_spi1_region.fna')\n",
    "\n",
    "# Take a look at the first 500 bases to make sure we got it right.\n",
    "seq[:500]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks good!\n",
    "\n",
    "And for the λ-DNA...."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCGTCATAACTTAATGTTTTTATTTAAAATACCCTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAACAAAAAGCAGCTGGCTGACATTTTCGGTGCGAGTATCCGTACCATTCAGAACTGGCAGGAACAGGGAATGCCCGTTCTGCGAGGCGGTGGCAAGGGTAATGAGGTGCTTTATGACTCTGCCGCCGTCATAAAATGGTATGCCGAAAGGGATGCTGAAATTGAGAACGAAAAGCTGCGCCGGGAGGTTGAAGAACTGCGGCAGGCCAGCGAGGCAGATCTCCAGCCAGGAACTATTGAGTACGAACGCCATCGACTTACGCGTGCGCAGGCCGACGCACAGGAACTGAAGAATGCCAGAG'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "descriptor_lambda, seq_lambda = read_fasta('data/lambdaDNA.fasta')\n",
    "\n",
    "# Take a look at the first 500 bases to make sure we got it right.\n",
    "seq_lambda[:500]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPython 3.7.7\n",
      "IPython 7.16.1\n",
      "\n",
      "jupyterlab 2.1.5\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark -v -p jupyterlab"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}