Exercise 3.1: Parsing a FASTA file


There are packages, like Biopython and scikit-bio for processing files you encounter in bioinformatics. In this problem, though, we will work on our file I/O skills.

a) Use command line tools to investigate the FASTA file located at ~git/bootcamp/data/salmonella_spi1_region.fna. This file contains a portion of the Salmonella genome (described in Exercise 4.1).

You will notice that the first line begins with a >, signifying that the line contains information about the sequence. The remainder of the lines are the sequence itself.

b) The format of the Salmonella SPI1 region FASTA file is a common format for such files (though oftentimes FASTA files contain multiple sequences). Use the file I/O skills you have learned to write a function to read in a sequence from a FASTA file containing a single sequence (but possibly having the first line in the file beginning with >). Your function should take as input the name of the FASTA file and return two strings. First, it should return the descriptor string (which starts with >). Second, it should return a string with no gaps containing the sequence.

Test your function on the Salmonella sequence.

Solution

a) A quick look using less indicates that this is a valid FASTA file. I did a quick check to see how many sequences there are by searching for the > symbol using grep.

[1]:
!grep ">" data/salmonella_spi1_region.fna
>gi|821161554|gb|CP011428.1| Salmonella enterica subsp. enterica strain YU39, complete genome, subsequence 3000000 to 3200000

We can do the same for the λ DNA file we will use in Exercise 4.2.

[2]:
!grep ">" data/lambdaDNA.fasta
>Lambda_NEB

We see that in both there is a single record. So, when we read it in, we just need to skip the first line and read on until we get the full record.

b) We will read in the files line by line, saving the first line as the descriptor, and then building the sequence as we go along.

[3]:
def read_fasta(filename):
    """Read a sequence in from a FASTA file containing a single sequence.

    We assume that the first line of the file is the descriptor and all
    subsequent lines are sequence.
    """
    with open(filename, 'r') as f:
        # Read in descriptor
        descriptor = f.readline().rstrip()

        # Read in sequence, stripping the whitespace from each line
        seq = ''
        line = f.readline().rstrip()
        while line != '':
            seq += line
            line = f.readline().rstrip()

    return descriptor, seq

Of course, when writing this function, we should be taking a test-driven development approach, but here we are focusing on our file I/O and string handling skills.

Let’s take the function for a drive!

[4]:
descriptor, seq = read_fasta('data/salmonella_spi1_region.fna')

# Take a look at the first 500 bases to make sure we got it right.
seq[:500]
[4]:
'AAAACCTTAGTAACTGGACTGCTGGGATTTTTCAGCCTGGATACGCTGGTAGATCTCTTCACGATGGACAGAAACTTCTTTCGGGGCGTTCACGCCAATACGCACCTGGTTGCCCTTCACCCCTAAAACTGTCACGGTGACCTCATCGCCAATCATGAGGGTCTCACCAACTCGACGAGTCAGAATCAGCATTCTTTGCTCCTTGAAAGATTAAAAGAGTCGGGTCTCTCTGTATCCCGGCATTATCCATCATATAACGCCAAAAAGTAAGCGATGACAAACACCTTAGGTGTAAGCAGTCATGGCATTACATTCTGTTAAACCTAAGTTTAGCCGATATACAAAACTTCAACCTGACTTTATCGTTGTCGATAGCGTTGACGTAAACGCCGCAGCACGGGCTGCGGCGCCAACGAACGCTTATAATTATTGCAATTTTGCGCTGACCCAGCCTTGTACACTGGCTAACGCTGCAGGCAGAGCTGCCGCATCCGTACCAC'

Looks good!

And for the λ-DNA….

[5]:
descriptor_lambda, seq_lambda = read_fasta('data/lambdaDNA.fasta')

# Take a look at the first 500 bases to make sure we got it right.
seq_lambda[:500]
[5]:
'GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCGTCATAACTTAATGTTTTTATTTAAAATACCCTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAACAAAAAGCAGCTGGCTGACATTTTCGGTGCGAGTATCCGTACCATTCAGAACTGGCAGGAACAGGGAATGCCCGTTCTGCGAGGCGGTGGCAAGGGTAATGAGGTGCTTTATGACTCTGCCGCCGTCATAAAATGGTATGCCGAAAGGGATGCTGAAATTGAGAACGAAAAGCTGCGCCGGGAGGTTGAAGAACTGCGGCAGGCCAGCGAGGCAGATCTCCAGCCAGGAACTATTGAGTACGAACGCCATCGACTTACGCGTGCGCAGGCCGACGCACAGGAACTGAAGAATGCCAGAG'

Computing environment

[6]:
%load_ext watermark
%watermark -v -p jupyterlab
CPython 3.7.7
IPython 7.13.0

jupyterlab 1.2.6