Exercise 2.1: Parsing a FASTA file


There are packages, like Biopython and scikit-bio for processing files you encounter in bioinformatics. In this problem, though, we will work on our file I/O skills.

a) Use command line tools to investigate the FASTA file located at ~/git/bootcamp/data/salmonella_spi1_region.fna. This file contains a portion of the Salmonella genome (described in Exercise 4.1).

You will notice that the first line begins with a >, signifying that the line contains information about the sequence. The remainder of the lines are the sequence itself.

b) The format of the Salmonella SPI1 region FASTA file is a common format for such files (though oftentimes FASTA files contain multiple sequences). Use the file I/O skills you have learned to write a function to read in a sequence from a FASTA file containing a single sequence (but possibly having the first line in the file beginning with >). Your function should take as input the name of the FASTA file and return two strings. First, it should return the descriptor string (which starts with >). Second, it should return a string with no gaps containing the sequence.

Test your function on the Salmonella sequence.