Regular expressions

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

In [1]:
# to use regular expressions in Python, we need to import the re module
import re

In this tutorial, we will learn about very useful and pervasive text processing tools called regular expressions, a.k.a. RE, re, regex, regexp, regex patterns. They are tools for specificying text matching patterns. The desired patterns can then be quickly and automatically extracted from possible large amounts of text. This is particularly useful in parsing files in bioinformatics.

Background

The underlying theory of regexes dates back to 1956. It was developed by Stephen Cole Kleene as part of theoretical computer science's attempt to describe a regular language. In the late 60's and early 70's regexes got broader attention as they could be used in lexical analysis of compilers and in pattern matching of text editors such as ed. Executing the latter function is synonymous to employing Ctrl+F or Command+F today to search a document for text.

Regular expressions are in itself a very specialized and compact programing language that can be employed straight away in a number of command line tools such as awk, sed, and grep. Higher level programming languages such as Perl and Python employ a slightly modified version. To use them in Python we need to import the re module.

Before we move on to see how regexes are used in Python let's first have a look at a simple example of regexes at work:

The file towels.txt contains some relevant information regarding towels. We will use the grep on the command line to print lines that match a regular expression. The syntax for grep is:

grep "some search string" some_file.txt

(The name grep is inspired by the ed command g/re/p, globular search/regular expression/print)

First of all a quick look at the text file. For this we use cat. On my system the file is stored in the file ../data/towels.txt.

In [2]:
!cat '../data/towels.txt'
cat: ../data/towels.txt: No such file or directory

Since the text concerns towels we might be interested in getting every line that contains the term "towel".

In [4]:
!grep "towel" ../data/towels.txt
say on the subject of towels.
A towel, it says, is about the most massively useful thing
ravenous); you can wave your towel in emergencies as a
towel has immense psychological value.  For some reason, if
hiker has his towel with him, he will automatically assume
odds, win through, and still knows where his towel is is

Please note that this finds lines that contain the string "towel" as well as as the string "towels".

A more interesting problem arises when we want to look up lines that say something about hitchhikers. A quick look at the text above shows us that Douglas Adams enjoyed his poetic liberty as far as the spelling of the word "hitchhiker" was concerned. We find the following versions: Hitch Hiker hitchhiker hitch hiker Hitchhiker

Can regexes help us to find all the different spellings?

In [5]:
!grep "\(H\|h\)itch *\(H\|h\)iker" ../data/towels.txt
The Hitch Hiker's Guide to the Galaxy has a few things to
an interstellar hitchhiker can have.  Partly it has great
a strag (strag: non-hitch hiker) discovers that a hitch
Douglas Adams:  "The Hitchhiker's Guide to the Galaxy" 

The escape character \ tells grep to treat the following character differently. The pipe character | is a logical OR. Thus, the expression \(H\|h\) is interpreted as "H or h". The Kleene star, *, means that the preceeding character occurs zero, one, or more times. In our example this means that the strings hitch and hiker are either separated by one or more spaces, or not separated at all.

Often it is possible to employ a number of different REs to the same end. In more advanced cases chosing the appropriate RE can have a significant impact on performance. In many other cases however the time gained by executing the perfect RE won't make up for the time invested in crafting it.

REs in Python

Python's re module enables use of REs within Python programs. This is useful for parsing strings. The standard library documentation is quite good and is a useful resource.

Unlike most other code we write in Python, REs are compiled into series of bytecode and executed in C, which makes them extremely fast. After we have compiled a regular expression, it has methods, such as match, that allow us to process strings with our compiled regular expression.

Compiling REs

We will investigate the syntax of the re module, as usual, through example. In this case, we will use an RE to find instances of hitchhiker-like words in the text from Douglas Adams.

In [6]:
# Read file contents as a single string
with open('../data/towels.txt') as f:
    hh_string = f.read()
    
# Define the regex pattern
pattern = '.*[H|h]itch *[H|h]iker.*'
regex = re.compile(pattern)

# Get list: each item is string with line that has variant of hitchhiker in it
regex.findall(hh_string)
Out[6]:
["The Hitch Hiker's Guide to the Galaxy has a few things to",
 'an interstellar hitchhiker can have.  Partly it has great',
 'a strag (strag: non-hitch hiker) discovers that a hitch',
 'Douglas Adams:  "The Hitchhiker\'s Guide to the Galaxy" ']

We get a list of lines in the towels.txt file that have a hitchhiker-like word in them.

Notice that the syntax is a bit different than the regex we used with grep on the command line. First, let's discuss the pattern.

.*[H|h]itch *[H|h]iker.*

Importantly, we do not use the escape character, \. We use it instead if we want to match explicitly the following character. I.e., if we are actually looking for a bracket in the string.

In [7]:
# Define the regex pattern
pattern = '\['
regex = re.compile(pattern)

# Get list: each item is string with line that has variant of hitchhiker in it
regex.findall('Find the[ bracket.')
Out[7]:
['[']

Looking again at our .*[H|h]itch *[H|h]iker.* pattern.... The opening and closing .* mean that we do not care what comes before or after the hitchhiker-like expression in the line. The "*" in the middle of the expression means the same as in the command line case: arbitrarily many spaces (including zero) may be between hitch and hiker.

We use [H/h] to mean either upper or lowercase H. This is in contrast to the regex we used with grep, which used parentheses. In Python's re module, parentheses serve to form groups. Let's see what happens if we use some parentheses.

In [8]:
# Define the regex pattern
pattern = '(.*((H|h)itch *(H|h)iker).*)'
regex = re.compile(pattern)

# Get list: each item is string with line that has variant of hitchhiker in it
regex.findall(hh_string)
Out[8]:
[("The Hitch Hiker's Guide to the Galaxy has a few things to",
  'Hitch Hiker',
  'H',
  'H'),
 ('an interstellar hitchhiker can have.  Partly it has great',
  'hitchhiker',
  'h',
  'h'),
 ('a strag (strag: non-hitch hiker) discovers that a hitch',
  'hitch hiker',
  'h',
  'h'),
 ('Douglas Adams:  "The Hitchhiker\'s Guide to the Galaxy" ',
  'Hitchhiker',
  'H',
  'h')]

The parenthesis form a hierarchy of groups. At the outermost level, we get the entire line that has a hitchhiker-like word. At the next level, we get the actual hitchhiker-like word. And at the innermost level, we get the individual H characters.

Flags in REs

We can also compile REs with flags. These are given as a second argument for the re.compile function and specify variants on how the RE compilation is to be done. For example, we could have used a flag to make our RE even simpler.

In [9]:
# Define the regex pattern
pattern = '.*hitch *hiker.*'
regex = re.compile(pattern, re.IGNORECASE)

# Get list: each item is string with line that has variant of hitchhiker in it
regex.findall(hh_string)
Out[9]:
["The Hitch Hiker's Guide to the Galaxy has a few things to",
 'an interstellar hitchhiker can have.  Partly it has great',
 'a strag (strag: non-hitch hiker) discovers that a hitch',
 'Douglas Adams:  "The Hitchhiker\'s Guide to the Galaxy" ']

The re.IGNORECASE flag allowed us to avoid having to have [H|h] in the RE. This just tells re.compile to treat lowercase and uppercase characters the same. Let's have a look at the available flags:

flag Description
re.DEBUG Displays debugging information about compiled expression
re.IGNORECASE Case insensitive matching
re.MULTILINE ^ and $ also match the beginning and end of a line respectively.
re.DOTALL As mentioned above that allows . to match any character.
re.VERBOSE Allows the usage of comments; everything left of the # will be ignored. This flag also ignores non-escaped whitespace (i.e., whitespace without the preceding \). This improves the readability of REs.

To combine flags, separate them with a vertical bar (the bitwise OR operator, which we did not cover in our discussion of operators). E.g.,

my_regex_query = re.compile("hitch *hiker", re.IGNORECASE | re.VERBOSE)

Metacharacters in Python REs

In the command line example using grep, we got acquainted with a few metacharacters that help us improve our pattern matching. Let's have a look at all RE metacharacters that can be used in Python:

Metacharacter Description
. (dot) The ultimate wildcard. It matches any character other than the newline character (\n). If it is desireable to also match \n the alternative mode (re.DOTALL) can be invoked.
^ (caret) Matches the start of a new string and the position immediately after a newline character.
$ Similar to caret but goes for the end of the string and the character preceeding the newline character.
* The Kleene star * following a RE allows 0 or multiple repitition of the this expression. ab*c will match ac, abc, abbc, abbbc, ...
+ Similar to the Kleene star, but it matches 1 or more occurences of the preceding RE, thus ab+c matches abc, abbc, abbbc, but not ac.
? Matches 0 or 1 repetition of the RE. ab? matches a, and ab.
{m} Matches exactly m repeats. a{3} equals aaa
{m,n} Matches m to n repeats, a{2,4} yields aa, aaa, and aaaa. The lower and upper bounds are optional a{,4} is the same as a{0,4}. Omiting the upper bound a{4,} yields anything with four or more repetitions of a.
[] Square brackets are used to describe a set of characters eg: \[atcg\] matches, a, t, c, or g, \[a-z\] matches any lowercase ASCII letter. \[0-2\]\[0,9\] matches all numbers from 00 to 29. It is important to note that metacharacters lose their special function within sets. Thus, [(a\*b+)] matches (, a, \\, \*, b, +, and ).
\\ The escape character \\ makes sure that the following character is interpreted literally. The Kleene star (*) for example will be interpreted as a simple asterisk if prefaced by the escape character (\\\*)
| Logical or
(...) matches whatever regular expression is inside the parentheses. As we discussed, these serve to describe groupings.

Searching, Matching, Splitting, and more

Once we are happy with our compiled RE we can deploy it in a number of ways. Upon compilation, we have created a compiled SRE_Pattern object that has methods for searching strings. In the table below, we will assume the compiled object is called regex.

action Description
regex.search(string, flags=0) Scans through string and returns first matching object
regex.match(string, flags=0) Returns object if zero or more characters at the beginning of the string match.
regex.fullmatch(string, flags=0) returns matching object if the whole string matches the RE otherwise RE is returned.
regex.findall(string, flags=0) Returns a list of all matches in the string. If there was grouping, each entry in the list is tuple, where each entry has the a match for different levels of grouping.
regex.finditer(string, flags=0) Same as regex.findall(), except returns an iterator that yields a match object instead of a list.
regex.split(string, maxsplit=0, flags=0) A new feature from Python 3.4, splits the string into a list by occurrences of patterns, see example below

Combinations of metacharacters in RE

greed vs non-greedy

+, \*, and ? match as much text as possible. This behavior is referred to as greedy. Adding a ? after these qualifiers renders them non greedy, yielding the shortest possible answer. For example applying the RE (K.\*F) to this amino acid sequence: MKKSLVFAFFAFFLSL yields: KKSLVFAFFAFF whereas (K.\*?F) would yield: KSLVF

More escaping

Let's have a look at how a word is defined before having a look at more escape options. A word is a sequence of Unicode alphanumeric or underscore characters. Examples are:

Hello_world
P4ssw0rd

Unless we specify wild card characters or ask for whole lines, a regex search will return words. We can further specify which words will be returned with the escape characters below.

\. Description
\number Matches the number-times repeat of a group. For example applying (.+) \1 to the string Homo sapiens sapiens returns sapiens sapiens
\A Matches the start of a string
\b Matches the empty string at the beginning or ending of a word. Thus using towel\b in the example above would only yield lines with the word towel and not towels.
\B Opposite of \\b. Thus towel\\B would yield towels.
\d Matches any unicode decimal digit.
\D Opposite of \\d (Are you seeing a pattern?)
\s Matches whitespace characters
\S any guess?
\w Matches Unicode word characters
\W your turn again:
\Z Matches only the end of the string

The very powerful question mark

With the help of ? we can expand the functionality of parentheses. The general syntax is (?...). We will not get into these here, but the table below gives a summary, and more detail can be found in the re package documentation.

(?...) Description
(?HKRED) Matches one or more characters
(?:...) Non-capturing version of the regular parentheses
(?<name>...) The matched string is accessible by the symbolic group name name.
(?P=name) Matches the string defined in (?<name>...)
(?#...) A comment, contents are ignored
(?!...) Opposite of (...)
(?<=...) A positive lookbehind assertion, for example applying the following RE to an amino acid sequence (?<=(?HKRED)\[A-Z\]) yields residues with a preceding charged residue
(?<!...) Opposite of (?<=...) (anoter pattern?)
(?(id/name)yes-pattern|no-pattern) Matching with yes-pattern if group given with id or name exists, with no-pattern if it doesn't. The latter is optional.

A shortcut to avoid compiling

If you are not going to use your compiled regex over and over again (you often will), you may want to use some of the functions in the re module. For example, if we wanted to find all occurences of a pattern in a string, we have learned to do this:

regex = re.compile(pattern, flags=0)
regex.findall(string)

We could equivalently do this:

re.findall(pattern, string, flags=0)

They give exactly the same results.

In [10]:
# Set up pattern and compile it
pattern = '.*hitch *hiker.*'
regex = re.compile(pattern, flags=re.IGNORECASE)

# Precompiled one
print('regex.findall(hh_string):')
print(regex.findall(hh_string))

# Non-precompiled performance
print('\nre.findall(pattern, hh_string, flags=re.IGNORECASE):')
print(re.findall(pattern, hh_string, flags=re.IGNORECASE))
regex.findall(hh_string):
["The Hitch Hiker's Guide to the Galaxy has a few things to", 'an interstellar hitchhiker can have.  Partly it has great', 'a strag (strag: non-hitch hiker) discovers that a hitch', 'Douglas Adams:  "The Hitchhiker\'s Guide to the Galaxy" ']

re.findall(pattern, hh_string, flags=re.IGNORECASE):
["The Hitch Hiker's Guide to the Galaxy has a few things to", 'an interstellar hitchhiker can have.  Partly it has great', 'a strag (strag: non-hitch hiker) discovers that a hitch', 'Douglas Adams:  "The Hitchhiker\'s Guide to the Galaxy" ']

Example: using REs with sequence data

As an example use of regular expressions, we will parse a GenBank entry. The file genbank_seq.txt contains the sequence portion of a GenBank entry.

In [12]:
# On my machine it is stored at ../data/regex/genbank_seq.txt. 
!cat ../data/genbank_seq.txt
        1 mkkillsvlt afvavvlaac ggnsdsktln sldkikqngv vrigvfgdkp pfgyvdekgn
       61 nqgydialak riakelfgde nkvqfvlvea anrveflksn kvdiilanft qtpqraeqvd
      121 fcspymkval gvavpkdsni tsvedlkdkt lllnkgttad ayftqnypni ktlkydqnte
      181 tfaalmdkrg dalshdntll fawvkdhpdf kmgikelgnk dviapavkkg dkelkefidn
      241 liiklgqeqf fhkaydetlk ahfgddvkad dvvieggki

We have a sequence of amino acids, but it is broken down into groups of 10 residues and the beginning of each line is annotated. Note that the last group only has nine residues. We would like to get a string that contains only the sequence information. In this case, we just want the strings that match letters.

In [13]:
# Read in the string from the file
with open("../data/genbank_seq.txt", "r") as myfile:
    gb_input = myfile.read()

# Compile our regex search
aa = re.compile('([a-z]+)')

# Get a list of all the segments
list_of_seq_segments = aa.findall(gb_input)

print(list_of_seq_segments)
['mkkillsvlt', 'afvavvlaac', 'ggnsdsktln', 'sldkikqngv', 'vrigvfgdkp', 'pfgyvdekgn', 'nqgydialak', 'riakelfgde', 'nkvqfvlvea', 'anrveflksn', 'kvdiilanft', 'qtpqraeqvd', 'fcspymkval', 'gvavpkdsni', 'tsvedlkdkt', 'lllnkgttad', 'ayftqnypni', 'ktlkydqnte', 'tfaalmdkrg', 'dalshdntll', 'fawvkdhpdf', 'kmgikelgnk', 'dviapavkkg', 'dkelkefidn', 'liiklgqeqf', 'fhkaydetlk', 'ahfgddvkad', 'dvvieggki']

We can now join all of the items in this list into a single string to get the full amino acid sequence.

In [14]:
# combine the strings in the list to a single string
seq = ''.join(list_of_seq_segments)
seq
Out[14]:
'mkkillsvltafvavvlaacggnsdsktlnsldkikqngvvrigvfgdkppfgyvdekgnnqgydialakriakelfgdenkvqfvlveaanrveflksnkvdiilanftqtpqraeqvdfcspymkvalgvavpkdsnitsvedlkdktlllnkgttadayftqnypniktlkydqntetfaalmdkrgdalshdntllfawvkdhpdfkmgikelgnkdviapavkkgdkelkefidnliiklgqeqffhkaydetlkahfgddvkaddvvieggki'

Another strategy would be to split string at the whitespaces.

In [15]:
# Split string at whitespaces 
temp = re.split('\s+', gb_input)
print(temp)
['', '1', 'mkkillsvlt', 'afvavvlaac', 'ggnsdsktln', 'sldkikqngv', 'vrigvfgdkp', 'pfgyvdekgn', '61', 'nqgydialak', 'riakelfgde', 'nkvqfvlvea', 'anrveflksn', 'kvdiilanft', 'qtpqraeqvd', '121', 'fcspymkval', 'gvavpkdsni', 'tsvedlkdkt', 'lllnkgttad', 'ayftqnypni', 'ktlkydqnte', '181', 'tfaalmdkrg', 'dalshdntll', 'fawvkdhpdf', 'kmgikelgnk', 'dviapavkkg', 'dkelkefidn', '241', 'liiklgqeqf', 'fhkaydetlk', 'ahfgddvkad', 'dvvieggki', '']

Admittedly, this is not the most efficient way but it illustrates the power of the re.split(). Splitting at whitespaces is commonly done when parsing other types of text files. By combining the relevant list elements we get to the sequence. We can detect segments of the string that are sequence element if they are strings with only letters. We use the isalpha method of strings to do this.

In [16]:
# Initialize list of sequence segments
segment_list = []

# Loop through each segment
for s in temp:
    # Keep a string if it consists of letters.
    if s.isalpha():
        segment_list.append(s)
        
# Join the list to get the resultant string
''.join(segment_list)
Out[16]:
'mkkillsvltafvavvlaacggnsdsktlnsldkikqngvvrigvfgdkppfgyvdekgnnqgydialakriakelfgdenkvqfvlveaanrveflksnkvdiilanftqtpqraeqvdfcspymkvalgvavpkdsnitsvedlkdktlllnkgttadayftqnypniktlkydqntetfaalmdkrgdalshdntllfawvkdhpdfkmgikelgnkdviapavkkgdkelkefidnliiklgqeqffhkaydetlkahfgddvkaddvvieggki'

And, just to show off, we can do this in a single line. Here, I'm using a list comprehension, which we will not cover in the bootcamp.

In [17]:
''.join([s for s in temp if s.isalpha()])
Out[17]:
'mkkillsvltafvavvlaacggnsdsktlnsldkikqngvvrigvfgdkppfgyvdekgnnqgydialakriakelfgdenkvqfvlveaanrveflksnkvdiilanftqtpqraeqvdfcspymkvalgvavpkdsnitsvedlkdktlllnkgttadayftqnypniktlkydqntetfaalmdkrgdalshdntllfawvkdhpdfkmgikelgnkdviapavkkgdkelkefidnliiklgqeqffhkaydetlkahfgddvkaddvvieggki'

Using match() and search()

Let's have a look at the outputs of aa.match() and aa.search(). We'll start with aa.match().

In [18]:
match_out = aa.match(gb_input)
print(match_out)
None

Since the RE doesn't match the beginning of the string the result is None. Let's instead just try to match anything using the .+ pattern.

In [19]:
# The following RE will yield the first line. 
# Adding the flag re.DOTALL would return the entire string
aa_gaps= re.compile('.+')
match_out = aa_gaps.match(gb_input)
print(match_out)
<_sre.SRE_Match object; span=(0, 75), match='        1 mkkillsvlt afvavvlaac ggnsdsktln sldkik>

This is a SRE_Match object. How do we use it? This object has several methods, and we can see what they do by example.

In [20]:
# access the match
match_out.group()
Out[20]:
'        1 mkkillsvlt afvavvlaac ggnsdsktln sldkikqngv vrigvfgdkp pfgyvdekgn'
In [21]:
# Access the location of the substring
match_out.span()
Out[21]:
(0, 75)

Use of aa.search() is similar. Remember, aa.search() does not just look at the beginning of the string, but scans the entire string for a match.

In [22]:
# Search for the first block of characters
search_out = aa.search(gb_input)

# Show result
print(search_out.group())
print(search_out.span())
mkkillsvlt
(10, 20)

Find and replace with regexes

Now, we'll use a powerful tool. We'll find and replace using a regex. For this example, we will replace all positively chared residues (arginine (a), lysine (l), and histodine (h)), and replace them with a + sign. To do this, we use the sub() method of a compiled regex.

In [23]:
# Compile our search string
positive_residues = re.compile('[rkh]')

# Find and replace
positive_residues.sub('+', seq)
Out[23]:
'm++illsvltafvavvlaacggnsds+tlnsld+i+qngvv+igvfgd+ppfgyvde+gnnqgydiala++ia+elfgden+vqfvlveaan+vefl+sn+vdiilanftqtpq+aeqvdfcspym+valgvavp+dsnitsvedl+d+tllln+gttadayftqnypni+tl+ydqntetfaalmd++gdals+dntllfawv+d+pdf+mgi+elgn+dviapav++gd+el+efidnlii+lgqeqff++aydetl+a+fgddv+addvviegg+i'
In [24]:
# In addition to substituting subn also counts the number of substitution. 
# It returns a tuple consisting of the modified string and the number of 
# substitutions
pos_seq, n_pos = positive_residues.subn('+', seq)

# Print the result
print("The following sequence contains", n_pos, 
      "positively charged residues:\n")
print(pos_seq)
The following sequence contains 42 positively charged residues:

m++illsvltafvavvlaacggnsds+tlnsld+i+qngvv+igvfgd+ppfgyvde+gnnqgydiala++ia+elfgden+vqfvlveaan+vefl+sn+vdiilanftqtpq+aeqvdfcspym+valgvavp+dsnitsvedl+d+tllln+gttadayftqnypni+tl+ydqntetfaalmd++gdals+dntllfawv+d+pdf+mgi+elgn+dviapav++gd+el+efidnlii+lgqeqff++aydetl+a+fgddv+addvviegg+i

The previous examples illustrate that RE are a powerful tool to parse data. Moreover they can be of great use in analysing sequences.

Example 2: Using REs to parse sequence FASTA files

The file aligned.fasta contains seven aligned sequences in fasta format. Let's have a look:

In [25]:
!cat ../data/aligned.fasta
>gi|488942278|ref|WP_002853353.1|:1-279 glutamine ABC transporter substrate-binding protein [Campylobacter jejuni]
MKKILLSVLTAFVAVVLAACGG-------NSDSKTLNSLDKIKQNGVVRIGVFGDKPPFG
YVDEKGNNQGYDIALAKRIAKELFGDENKVQFVLVEAANRVEFLKSNKVDIILANFTQTP
QRAEQVDFCSPYMKVALGVAVPKDSNITSVEDLKDKTLLLNKGTTADAYFTQNYPNIKTL
KYDQNTETFAALMDKRGDALSHDNTLLFAWVKDHPDFKMGIKELGNKDVIAPAVKKGDKE
LKEFIDNLIIKLGQEQFFHKAYDETLKAHFGDDVKADDVVIEGGKI
>gi|533111561|ref|WP_020974739.1|:1-279 amino-acid transporter periplasmic solute-binding protein [Campylobacter coli]
MKKMLLSIFTTFVAVFLAACGG-------NSDSNALNSLEKIKQEGVVRIGVFGDKPPFG
YVDEKGANQGYDIVLAKRIAKELLGDENKVQFVLVEAANRVEFLKSNKVDIILANFTQTP
ERAEQVDFCLPYMKVALGVAVPQDSNISSVEDLKDKTLLLNKGTTADAYFTKEYPDIKTL
KYDQNTETFAALMDQRGDALSHDNTLLFAWVKDHPEFKMAIKELGNKDVIAPAVKKGNKE
LKEFIDNLIVKLGEEQFFHKAYEETLKTHFGDDVKADDVVIEGGKI
>gi|736962278|ref|WP_034958763.1|:1-277 ABC transporter substrate-binding protein [Campylobacter upsaliensis]
MKKILLSIFTAFVAVFLAAC---------DSSESGVNSIERIKNAGVVKIGVFGDKPPFG
YVDEKGANQGYDIIFAKRIAKELLGDENKVEFVLVEAANRVEFLKSNKVDIILANFTQTP
ERAEQVDFALPYMKVALGVVVPEDSEIKSVEDLKDKTLILNKGTTADAYFTKNYADIKTL
KFDQNTETFAALMDKRGDALAHDNTLLFAWVKERPDYKVVIKELGNQDVIAPAVKKGDKE
LKEFIDNLIISLAAEQFFHKAYDESLKAHFGADIKADDVVIEGGKL
>gi|746591941|ref|WP_039618158.1|:1-275 ABC transporter substrate-binding protein [Campylobacter lari]
MKKIFL--LSFLMALFFSAC---------SNSSSNENSIEKIKQQGVIRIGVFGDKPPFG
YLDAQGKNQGYDVYFAKRIAKELLGDESKVQFVLVEAANRVEFLESNKVDLILANFTKTP
EREAVVDFAFPYMKVALGVVAPKGSDIKTIDDLKSKTLILNKGTTADAYFTKNMPEIKTI
KFDQNTETFAALIGKRGDALSHDNALLFAWAKENPNFEVVIKELGNHDVIAPAVKKGDEA
MLKFINDLIVKLQNEQFFHKAYDETLKPFFSDDIKADDVVIEGGKI
>gi|757802533|ref|WP_043019716.1|:1-275 ABC transporter substrate-binding protein [Campylobacter subantarcticus]
MKKIFL--LSFLMTLFFSAC---------SNSSSNENSIEKIKQQGVIRIGVFGDKPPFG
YLDAQGKNQGYDVYFAKRITKELLGDESKVQFVLVEAANRVEFLESNKVDLILANFTKTP
EREAVVDFALPYMKVALGVVAPKNSDIKTVDDLKNKTLIINKGTTADAYFTKNMPEIKTI
KFDQNTETFAALIGKRGDALSHDNALLFAWAKENPNFEVVIKELGNHDVIAPAVKKGDEA
MLKFINDLILKLQNEQFFHKAYDETLKPFFSDDIKADDVVIEGGKI
>gi|763018407|ref|WP_043902698.1|:13-275 glutamine ABC transporter substrate-binding protein [Helicobacter cetorum]
--------LLVLVTLIFNAC----------SDKPKLDALDSIKQKGVVRIGVFSDKPPFG
FVDSKGAYQGFDVYIAKRMAKDLLGDENKIEFVPVEASARVEFLKANKVDIIMANFTQTN
ERKEVVDFAKPYMKVALGVVS-KNGMIKDIEELKDKTLIVNKGTTADFYFTKNYPNIKLL
KFEQNTETFLALLNNRGEALAHDNTLLFAWAKQHPEFKVAITSLGDKDVIAPAIKKGNPK
LLEWLNNEVQQLINEGFLKEAYKETLEPVYGSDIKSEEIVFE----
>gi|654467138|ref|WP_027937534.1|:5-289 hypothetical protein [Anaeroarcus burkinensis]
-KGIVLIFSVLFIVGMLAGCGSSAKQGDQKDAAAAKSSIEEIKQRGVLRVGVFSDKPPFG
FVDKSGKNQGFDVVIAKRFAKDLLGDETKIEFVLVEAANRVEVLQSNKVDITMANFTVTD
ERKQKVDFANPYMKVYLGVVSPSGTPITSVEQLKGKKLIVNKGTTAETYFTKNHPDIELL
KYDQNTEAFEALKDNRGAALAHDNTLLFAWAKENTGYQVGIPTLGGQDTIAPAVKKGNKE
LLDWVNTELETLGKEKFIHKAYDETLKPAYGDSINPEDIVVEGGKL

Each entry starts with a > sign that describes the source of the sequence, and then the aligned sequence of amino acids follow. As we see, the first sequence is described by

>gi|488942278|ref|WP_002853353.1|:1-279 glutamine ABC transporter substrate-binding protein [Campylobacter jejuni]

It is often convenient to just have a short species identifer instead of the full description. So, we would replace the above description with

>C.jejuni

To do this, we need to do the following steps.

  1. Read the FASTA file into a string
  2. Split the string into entries
  3. For each entry, find the organism name.
  4. Construct an abbreviation for the organism name.
  5. Replace the organism name with the abbreviation.
In [26]:
# read the aligned fasta file
with open("../data/aligned.fasta", "r") as myfile:
    aln = myfile.read()

# add an extra delimiter to enable splitting
aln = re.sub('>', 'delimiter>', aln)

# Use re.split to create a list of fasta enries.
aln = re.split('delimiter', aln)

# remove the first empty string from the list, since it got split
aln = aln[1:]
    
# RE to find the organism name: look for text in brackets and make
# convenient groups for parsing genus and species
get_organism = re.compile('\[(\w+) (\w+)\]')

# RE that defines the line to be edited
def_line = re.compile('(>.*)')

# A list that we'll concatenate into the file
edited_list = []
for fasta in aln:
    # Get the organism name
    org = get_organism.search(fasta)

    # Abbreviate the organism's name in the desired form
    abbr = '>' + org.group(1)[0] + '. ' + org.group(2)
    
    # Replace the original sequence description with abbreviated organism name.
    out = def_line.sub(abbr, fasta)

    # Add out new string to the list
    edited_list.append(out)

# Write out the result
with open('edited_fasta.fasta', 'w') as outfile:
    outfile.write(''.join(edited_list))

Let's see how we did!

In [27]:
!cat edited_fasta.fasta
>C. jejuni
MKKILLSVLTAFVAVVLAACGG-------NSDSKTLNSLDKIKQNGVVRIGVFGDKPPFG
YVDEKGNNQGYDIALAKRIAKELFGDENKVQFVLVEAANRVEFLKSNKVDIILANFTQTP
QRAEQVDFCSPYMKVALGVAVPKDSNITSVEDLKDKTLLLNKGTTADAYFTQNYPNIKTL
KYDQNTETFAALMDKRGDALSHDNTLLFAWVKDHPDFKMGIKELGNKDVIAPAVKKGDKE
LKEFIDNLIIKLGQEQFFHKAYDETLKAHFGDDVKADDVVIEGGKI
>C. coli
MKKMLLSIFTTFVAVFLAACGG-------NSDSNALNSLEKIKQEGVVRIGVFGDKPPFG
YVDEKGANQGYDIVLAKRIAKELLGDENKVQFVLVEAANRVEFLKSNKVDIILANFTQTP
ERAEQVDFCLPYMKVALGVAVPQDSNISSVEDLKDKTLLLNKGTTADAYFTKEYPDIKTL
KYDQNTETFAALMDQRGDALSHDNTLLFAWVKDHPEFKMAIKELGNKDVIAPAVKKGNKE
LKEFIDNLIVKLGEEQFFHKAYEETLKTHFGDDVKADDVVIEGGKI
>C. upsaliensis
MKKILLSIFTAFVAVFLAAC---------DSSESGVNSIERIKNAGVVKIGVFGDKPPFG
YVDEKGANQGYDIIFAKRIAKELLGDENKVEFVLVEAANRVEFLKSNKVDIILANFTQTP
ERAEQVDFALPYMKVALGVVVPEDSEIKSVEDLKDKTLILNKGTTADAYFTKNYADIKTL
KFDQNTETFAALMDKRGDALAHDNTLLFAWVKERPDYKVVIKELGNQDVIAPAVKKGDKE
LKEFIDNLIISLAAEQFFHKAYDESLKAHFGADIKADDVVIEGGKL
>C. lari
MKKIFL--LSFLMALFFSAC---------SNSSSNENSIEKIKQQGVIRIGVFGDKPPFG
YLDAQGKNQGYDVYFAKRIAKELLGDESKVQFVLVEAANRVEFLESNKVDLILANFTKTP
EREAVVDFAFPYMKVALGVVAPKGSDIKTIDDLKSKTLILNKGTTADAYFTKNMPEIKTI
KFDQNTETFAALIGKRGDALSHDNALLFAWAKENPNFEVVIKELGNHDVIAPAVKKGDEA
MLKFINDLIVKLQNEQFFHKAYDETLKPFFSDDIKADDVVIEGGKI
>C. subantarcticus
MKKIFL--LSFLMTLFFSAC---------SNSSSNENSIEKIKQQGVIRIGVFGDKPPFG
YLDAQGKNQGYDVYFAKRITKELLGDESKVQFVLVEAANRVEFLESNKVDLILANFTKTP
EREAVVDFALPYMKVALGVVAPKNSDIKTVDDLKNKTLIINKGTTADAYFTKNMPEIKTI
KFDQNTETFAALIGKRGDALSHDNALLFAWAKENPNFEVVIKELGNHDVIAPAVKKGDEA
MLKFINDLILKLQNEQFFHKAYDETLKPFFSDDIKADDVVIEGGKI
>H. cetorum
--------LLVLVTLIFNAC----------SDKPKLDALDSIKQKGVVRIGVFSDKPPFG
FVDSKGAYQGFDVYIAKRMAKDLLGDENKIEFVPVEASARVEFLKANKVDIIMANFTQTN
ERKEVVDFAKPYMKVALGVVS-KNGMIKDIEELKDKTLIVNKGTTADFYFTKNYPNIKLL
KFEQNTETFLALLNNRGEALAHDNTLLFAWAKQHPEFKVAITSLGDKDVIAPAIKKGNPK
LLEWLNNEVQQLINEGFLKEAYKETLEPVYGSDIKSEEIVFE----
>A. burkinensis
-KGIVLIFSVLFIVGMLAGCGSSAKQGDQKDAAAAKSSIEEIKQRGVLRVGVFSDKPPFG
FVDKSGKNQGFDVVIAKRFAKDLLGDETKIEFVLVEAANRVEVLQSNKVDITMANFTVTD
ERKQKVDFANPYMKVYLGVVSPSGTPITSVEQLKGKKLIVNKGTTAETYFTKNHPDIELL
KYDQNTEAFEALKDNRGAALAHDNTLLFAWAKENTGYQVGIPTLGGQDTIAPAVKKGNKE
LLDWVNTELETLGKEKFIHKAYDETLKPAYGDSINPEDIVVEGGKL

Success!

Just scratching the surface

Even though we've just scratched the surface, I hope this tutorial has shown how powerful you can be if you have a command of regular expressions. The learning curve for these things is steep, but once you've used them for a while, they become second nature and parsing text is not a daunting task!