{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Exercise 3.1: Parsing a FASTA file\n", "\n", "
"]}, {"cell_type": "markdown", "metadata": {}, "source": ["There are packages, like [Biopython](http://biopython.org/) and [scikit-bio](http://scikit-bio.org) for processing files you encounter in bioinformatics. In this problem, though, we will work on our file I/O skills. \n", "\n", "**a)** Use command line tools to investigate the [FASTA file](https://en.wikipedia.org/wiki/FASTA_format) located at `~git/bootcamp/data/salmonella_spi1_region.fna`. This file contains a portion of the _Salmonella_ genome (described in [Exercise 4.1](exercise_4.1.ipynb)).\n", "\n", "You will notice that the first line begins with a `>`, signifying that the line contains information about the sequence. The remainder of the lines are the sequence itself.\n", "\n", "**b)** The format of the _Salmonella_ SPI1 region FASTA file is a common format for such files (though oftentimes FASTA files contain multiple sequences). Use the file I/O skills you have learned to write a function to read in a sequence from a FASTA file containing a single sequence (but possibly having the first line in the file beginning with `>`). Your function should take as input the name of the FASTA file and return two strings. First, it should return the descriptor string (which starts with `>`). Second, it should return a string with no gaps containing the sequence.\n", "\n", "Test your function on the _Salmonella_ sequence."]}, {"cell_type": "markdown", "metadata": {}, "source": ["
"]}], "metadata": {"anaconda-cloud": {}, "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7"}}, "nbformat": 4, "nbformat_minor": 4}