{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 4.1: Parsing a FASTA file\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are packages, like [Biopython](http://biopython.org/) and [scikit-bio](http://scikit-bio.org) for processing files you encounter in bioinformatics. In this problem, though, we will work on our file I/O skills. \n", "\n", "**a)** Use command line tools to investigate the [FASTA file](https://en.wikipedia.org/wiki/FASTA_format) located at `~git/bootcamp/data/salmonella_spi1_region.fna`. This file contains a portion of the _Salmonella_ genome.\n", "\n", "You will notice that the first line begins with a `>`, signifying that the line contains information about the sequence. The remainder of the lines are the sequence itself.\n", "\n", "**b)** The format of the _Salmonella_ SPI1 region FASTA file is a common format for such files (though oftentimes FASTA files contain multiple sequences). Use the file I/O skills you have learned to write a function to read in a sequence from a FASTA file containing a single sequence (but possibly having the first line in the file beginning with `>`). Your function should take as input the name of the FASTA file and return two strings. First, it should return the descriptor string (which starts with `>`). Second, it should return a string with no gaps containing the sequence.\n", "\n", "Test your function on the _Salmonella_ sequence." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution\n", "\n", "**a)** A quick look using `less` indicates that this is a valid FASTA file. I did a quick check to see how many sequences there are by searching for the `>` symbol using `grep`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">gi|821161554|gb|CP011428.1| Salmonella enterica subsp. enterica strain YU39, complete genome, subsequence 3000000 to 3200000\n" ] } ], "source": [ "!grep \">\" data/salmonella_spi1_region.fna" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do the same for the λ DNA file we will use in [Exercise 4.2](exercise_4.2.ipynb)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">Lambda_NEB\n" ] } ], "source": [ "!grep \">\" data/lambdaDNA.fasta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that in both there is a single record. So, when we read it in, we just need to skip the first line and read on until we get the full record." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**b)** We will read in the files line by line, saving the first line as the descriptor, and then building the sequence as we go along." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def read_fasta(filename):\n", " \"\"\"Read a sequence in from a FASTA file containing a single sequence.\n", " \n", " We assume that the first line of the file is the descriptor and all\n", " subsequent lines are sequence. \n", " \"\"\"\n", " with open(filename, 'r') as f:\n", " # Read in descriptor\n", " descriptor = f.readline().rstrip()\n", "\n", " # Read in sequence, stripping the whitespace from each line\n", " seq = ''\n", " line = f.readline().rstrip()\n", " while line != '':\n", " seq += line\n", " line = f.readline().rstrip()\n", " \n", " return descriptor, seq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, when writing this function, we should be taking a test-driven development approach (and we will talk about TDD in the last week of the bootcamp), but here we are focusing on our file I/O and string handling skills.\n", "\n", "Let's take the function for a drive!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'AAAACCTTAGTAACTGGACTGCTGGGATTTTTCAGCCTGGATACGCTGGTAGATCTCTTCACGATGGACAGAAACTTCTTTCGGGGCGTTCACGCCAATACGCACCTGGTTGCCCTTCACCCCTAAAACTGTCACGGTGACCTCATCGCCAATCATGAGGGTCTCACCAACTCGACGAGTCAGAATCAGCATTCTTTGCTCCTTGAAAGATTAAAAGAGTCGGGTCTCTCTGTATCCCGGCATTATCCATCATATAACGCCAAAAAGTAAGCGATGACAAACACCTTAGGTGTAAGCAGTCATGGCATTACATTCTGTTAAACCTAAGTTTAGCCGATATACAAAACTTCAACCTGACTTTATCGTTGTCGATAGCGTTGACGTAAACGCCGCAGCACGGGCTGCGGCGCCAACGAACGCTTATAATTATTGCAATTTTGCGCTGACCCAGCCTTGTACACTGGCTAACGCTGCAGGCAGAGCTGCCGCATCCGTACCAC'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "descriptor, seq = read_fasta('data/salmonella_spi1_region.fna')\n", "\n", "# Take a look at the first 500 bases to make sure we got it right.\n", "seq[:500]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks good!\n", "\n", "And for the λ-DNA...." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCGTCATAACTTAATGTTTTTATTTAAAATACCCTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAACAAAAAGCAGCTGGCTGACATTTTCGGTGCGAGTATCCGTACCATTCAGAACTGGCAGGAACAGGGAATGCCCGTTCTGCGAGGCGGTGGCAAGGGTAATGAGGTGCTTTATGACTCTGCCGCCGTCATAAAATGGTATGCCGAAAGGGATGCTGAAATTGAGAACGAAAAGCTGCGCCGGGAGGTTGAAGAACTGCGGCAGGCCAGCGAGGCAGATCTCCAGCCAGGAACTATTGAGTACGAACGCCATCGACTTACGCGTGCGCAGGCCGACGCACAGGAACTGAAGAATGCCAGAG'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "descriptor_lambda, seq_lambda = read_fasta('data/lambdaDNA.fasta')\n", "\n", "# Take a look at the first 500 bases to make sure we got it right.\n", "seq_lambda[:500]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing environment" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPython 3.7.7\n", "IPython 7.16.1\n", "\n", "jupyterlab 2.1.5\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -v -p jupyterlab" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }