This tutorial was generated from a Jupyter notebook. You can download the notebook here.
# to use regular expressions in Python, we need to import the re module
import re
In this tutorial, we will learn about very useful and pervasive text processing tools called regular expressions, a.k.a. RE, re, regex, regexp, regex patterns. They are tools for specificying text matching patterns. The desired patterns can then be quickly and automatically extracted from possible large amounts of text. This is particularly useful in parsing files in bioinformatics.
The underlying theory of regexes dates back to 1956. It was developed by Stephen Cole Kleene as part of theoretical computer science's attempt to describe a regular language. In the late 60's and early 70's regexes got broader attention as they could be used in lexical analysis of compilers and in pattern matching of text editors such as ed
. Executing the latter function is synonymous to employing Ctrl+F
or Command+F
today to search a document for text.
Regular expressions are in itself a very specialized and compact programing language that can be employed straight away in a number of command line tools such as awk
, sed
, and grep
. Higher level programming languages such as Perl and Python employ a slightly modified version. To use them in Python we need to import the re
module.
Before we move on to see how regexes are used in Python let's first have a look at a simple example of regexes at work:
The file towels.txt
contains some relevant information regarding towels. We will use the grep
on the command line to print lines that match a regular expression. The syntax for grep is:
grep "some search string" some_file.txt
(The name grep is inspired by the ed
command g/re/p
, globular search/regular expression/print)
First of all a quick look at the text file. For this we use cat
. On my system the file is stored in the file ../data/towels.txt
.
!cat '../data/towels.txt'
Since the text concerns towels we might be interested in getting every line that contains the term "towel".
!grep "towel" ../data/towels.txt
Please note that this finds lines that contain the string "towel
" as well as as the string "towels
".
A more interesting problem arises when we want to look up lines that say something about hitchhikers. A quick look at the text above shows us that Douglas Adams enjoyed his poetic liberty as far as the spelling of the word "hitchhiker" was concerned. We find the following versions: Hitch Hiker hitchhiker hitch hiker Hitchhiker
Can regexes help us to find all the different spellings?
!grep "\(H\|h\)itch *\(H\|h\)iker" ../data/towels.txt
The escape character \
tells grep to treat the following character differently. The pipe character |
is a logical OR. Thus, the expression \(H\|h\)
is interpreted as "H
or h
". The Kleene star, *
, means that the preceeding character occurs zero, one, or more times. In our example this means that the strings hitch and hiker are either separated by one or more spaces, or not separated at all.
Often it is possible to employ a number of different REs to the same end. In more advanced cases chosing the appropriate RE can have a significant impact on performance. In many other cases however the time gained by executing the perfect RE won't make up for the time invested in crafting it.
Python's re
module enables use of REs within Python programs. This is useful for parsing strings. The standard library documentation is quite good and is a useful resource.
Unlike most other code we write in Python, REs are compiled into series of bytecode and executed in C, which makes them extremely fast. After we have compiled a regular expression, it has methods, such as match
, that allow us to process strings with our compiled regular expression.
We will investigate the syntax of the re
module, as usual, through example. In this case, we will use an RE to find instances of hitchhiker-like words in the text from Douglas Adams.
# Read file contents as a single string
with open('../data/towels.txt') as f:
hh_string = f.read()
# Define the regex pattern
pattern = '.*[H|h]itch *[H|h]iker.*'
regex = re.compile(pattern)
# Get list: each item is string with line that has variant of hitchhiker in it
regex.findall(hh_string)
We get a list of lines in the towels.txt
file that have a hitchhiker-like word in them.
Notice that the syntax is a bit different than the regex we used with grep
on the command line. First, let's discuss the pattern.
.*[H|h]itch *[H|h]iker.*
Importantly, we do not use the escape character, \
. We use it instead if we want to match explicitly the following character. I.e., if we are actually looking for a bracket in the string.
# Define the regex pattern
pattern = '\['
regex = re.compile(pattern)
# Get list: each item is string with line that has variant of hitchhiker in it
regex.findall('Find the[ bracket.')
Looking again at our .*[H|h]itch *[H|h]iker.*
pattern.... The opening and closing .*
mean that we do not care what comes before or after the hitchhiker-like expression in the line. The "*
" in the middle of the expression means the same as in the command line case: arbitrarily many spaces (including zero) may be between hitch
and hiker
.
We use [H/h]
to mean either upper or lowercase H
. This is in contrast to the regex we used with grep
, which used parentheses. In Python's re
module, parentheses serve to form groups. Let's see what happens if we use some parentheses.
# Define the regex pattern
pattern = '(.*((H|h)itch *(H|h)iker).*)'
regex = re.compile(pattern)
# Get list: each item is string with line that has variant of hitchhiker in it
regex.findall(hh_string)
The parenthesis form a hierarchy of groups. At the outermost level, we get the entire line that has a hitchhiker-like word. At the next level, we get the actual hitchhiker-like word. And at the innermost level, we get the individual H
characters.
We can also compile REs with flags. These are given as a second argument for the re.compile
function and specify variants on how the RE compilation is to be done. For example, we could have used a flag to make our RE even simpler.
# Define the regex pattern
pattern = '.*hitch *hiker.*'
regex = re.compile(pattern, re.IGNORECASE)
# Get list: each item is string with line that has variant of hitchhiker in it
regex.findall(hh_string)
The re.IGNORECASE
flag allowed us to avoid having to have [H|h]
in the RE. This just tells re.compile
to treat lowercase and uppercase characters the same. Let's have a look at the available flags:
flag | Description |
---|---|
re.DEBUG |
Displays debugging information about compiled expression |
re.IGNORECASE |
Case insensitive matching |
re.MULTILINE |
^ and $ also match the beginning and end of a line respectively. |
re.DOTALL |
As mentioned above that allows . to match any character. |
re.VERBOSE |
Allows the usage of comments; everything left of the # will be ignored. This flag also ignores non-escaped whitespace (i.e., whitespace without the preceding \ ). This improves the readability of REs. |
To combine flags, separate them with a vertical bar (the bitwise OR operator, which we did not cover in our discussion of operators). E.g.,
my_regex_query = re.compile("hitch *hiker", re.IGNORECASE | re.VERBOSE)
In the command line example using grep
, we got acquainted with a few metacharacters that help us improve our pattern matching. Let's have a look at all RE metacharacters that can be used in Python:
Metacharacter | Description |
---|---|
. |
(dot) The ultimate wildcard. It matches any character other than the newline character (\n ). If it is desireable to also match \n the alternative mode (re.DOTALL ) can be invoked. |
^ |
(caret) Matches the start of a new string and the position immediately after a newline character. |
$ |
Similar to caret but goes for the end of the string and the character preceeding the newline character. |
* |
The Kleene star * following a RE allows 0 or multiple repitition of the this expression. ab*c will match ac , abc , abbc , abbbc , ... |
+ |
Similar to the Kleene star, but it matches 1 or more occurences of the preceding RE, thus ab+c matches abc , abbc , abbbc , but not ac . |
? |
Matches 0 or 1 repetition of the RE. ab? matches a , and ab . |
{m} |
Matches exactly m repeats. a{3} equals aaa |
{m,n} |
Matches m to n repeats, a{2,4} yields aa , aaa , and aaaa . The lower and upper bounds are optional a{,4} is the same as a{0,4} . Omiting the upper bound a{4,} yields anything with four or more repetitions of a . |
[] |
Square brackets are used to describe a set of characters eg: \[atcg\] matches, a , t , c , or g , \[a-z\] matches any lowercase ASCII letter. \[0-2\]\[0,9\] matches all numbers from 00 to 29. It is important to note that metacharacters lose their special function within sets. Thus, [(a\*b+)] matches ( , a , \\ , \* , b , + , and ) . |
\\ |
The escape character \\ makes sure that the following character is interpreted literally. The Kleene star (* ) for example will be interpreted as a simple asterisk if prefaced by the escape character (\\\* ) |
| | Logical or |
(...) |
matches whatever regular expression is inside the parentheses. As we discussed, these serve to describe groupings. |
Once we are happy with our compiled RE we can deploy it in a number of ways. Upon compilation, we have created a compiled SRE_Pattern
object that has methods for searching strings. In the table below, we will assume the compiled object is called regex
.
action | Description |
---|---|
regex.search(string, flags=0) |
Scans through string and returns first matching object |
regex.match(string, flags=0) |
Returns object if zero or more characters at the beginning of the string match. |
regex.fullmatch(string, flags=0) |
returns matching object if the whole string matches the RE otherwise RE is returned. |
regex.findall(string, flags=0) |
Returns a list of all matches in the string. If there was grouping, each entry in the list is tuple, where each entry has the a match for different levels of grouping. |
regex.finditer(string, flags=0) |
Same as regex.findall() , except returns an iterator that yields a match object instead of a list. |
regex.split(string, maxsplit=0, flags=0) |
A new feature from Python 3.4, splits the string into a list by occurrences of patterns, see example below |
+
, \*
, and ?
match as much text as possible. This behavior is referred to as greedy. Adding a ?
after these qualifiers renders them non greedy, yielding the shortest possible answer. For example applying the RE (K.\*F)
to this amino acid sequence:
MKKSLVFAFFAFFLSL
yields:
KKSLVFAFFAFF
whereas (K.\*?F)
would yield:
KSLVF
Let's have a look at how a word is defined before having a look at more escape options. A word is a sequence of Unicode alphanumeric or underscore characters. Examples are:
Hello_world
P4ssw0rd
Unless we specify wild card characters or ask for whole lines, a regex search will return words. We can further specify which words will be returned with the escape characters below.
\. | Description |
---|---|
\number |
Matches the number-times repeat of a group. For example applying (.+) \1 to the string Homo sapiens sapiens returns sapiens sapiens |
\A |
Matches the start of a string |
\b |
Matches the empty string at the beginning or ending of a word. Thus using towel\b in the example above would only yield lines with the word towel and not towels. |
\B |
Opposite of \\b . Thus towel\\B would yield towels. |
\d |
Matches any unicode decimal digit. |
\D |
Opposite of \\d (Are you seeing a pattern?) |
\s |
Matches whitespace characters |
\S |
any guess? |
\w |
Matches Unicode word characters |
\W |
your turn again: |
\Z |
Matches only the end of the string |
With the help of ?
we can expand the functionality of parentheses. The general syntax is (?...)
. We will not get into these here, but the table below gives a summary, and more detail can be found in the re
package documentation.
(?...) |
Description |
---|---|
(?HKRED) |
Matches one or more characters |
(?:...) |
Non-capturing version of the regular parentheses |
(?<name>...) |
The matched string is accessible by the symbolic group name name. |
(?P=name) |
Matches the string defined in (?<name>...) |
(?#...) |
A comment, contents are ignored |
(?!...) |
Opposite of (...) |
(?<=...) |
A positive lookbehind assertion, for example applying the following RE to an amino acid sequence (?<=(?HKRED)\[A-Z\]) yields residues with a preceding charged residue |
(?<!...) |
Opposite of (?<=...) (anoter pattern?) |
(?(id/name)yes-pattern |no-pattern) |
Matching with yes-pattern if group given with id or name exists, with no-pattern if it doesn't. The latter is optional. |
If you are not going to use your compiled regex over and over again (you often will), you may want to use some of the functions in the re
module. For example, if we wanted to find all occurences of a pattern in a string, we have learned to do this:
regex = re.compile(pattern, flags=0)
regex.findall(string)
We could equivalently do this:
re.findall(pattern, string, flags=0)
They give exactly the same results.
# Set up pattern and compile it
pattern = '.*hitch *hiker.*'
regex = re.compile(pattern, flags=re.IGNORECASE)
# Precompiled one
print('regex.findall(hh_string):')
print(regex.findall(hh_string))
# Non-precompiled performance
print('\nre.findall(pattern, hh_string, flags=re.IGNORECASE):')
print(re.findall(pattern, hh_string, flags=re.IGNORECASE))
As an example use of regular expressions, we will parse a GenBank entry. The file genbank_seq.txt
contains the sequence portion of a GenBank entry.
# On my machine it is stored at ../data/regex/genbank_seq.txt.
!cat ../data/genbank_seq.txt
We have a sequence of amino acids, but it is broken down into groups of 10 residues and the beginning of each line is annotated. Note that the last group only has nine residues. We would like to get a string that contains only the sequence information. In this case, we just want the strings that match letters.
# Read in the string from the file
with open("../data/genbank_seq.txt", "r") as myfile:
gb_input = myfile.read()
# Compile our regex search
aa = re.compile('([a-z]+)')
# Get a list of all the segments
list_of_seq_segments = aa.findall(gb_input)
print(list_of_seq_segments)
We can now join all of the items in this list into a single string to get the full amino acid sequence.
# combine the strings in the list to a single string
seq = ''.join(list_of_seq_segments)
seq
Another strategy would be to split string at the whitespaces.
# Split string at whitespaces
temp = re.split('\s+', gb_input)
print(temp)
Admittedly, this is not the most efficient way but it illustrates the power of the re.split()
. Splitting at whitespaces is commonly done when parsing other types of text files. By combining the relevant list elements we get to the sequence. We can detect segments of the string that are sequence element if they are strings with only letters. We use the isalpha
method of strings to do this.
# Initialize list of sequence segments
segment_list = []
# Loop through each segment
for s in temp:
# Keep a string if it consists of letters.
if s.isalpha():
segment_list.append(s)
# Join the list to get the resultant string
''.join(segment_list)
And, just to show off, we can do this in a single line. Here, I'm using a list comprehension, which we will not cover in the bootcamp.
''.join([s for s in temp if s.isalpha()])
match()
and search()
¶Let's have a look at the outputs of aa.match()
and aa.search()
. We'll start with aa.match()
.
match_out = aa.match(gb_input)
print(match_out)
Since the RE doesn't match the beginning of the string the result is None. Let's instead just try to match anything using the .+
pattern.
# The following RE will yield the first line.
# Adding the flag re.DOTALL would return the entire string
aa_gaps= re.compile('.+')
match_out = aa_gaps.match(gb_input)
print(match_out)
This is a SRE_Match
object. How do we use it? This object has several methods, and we can see what they do by example.
# access the match
match_out.group()
# Access the location of the substring
match_out.span()
Use of aa.search()
is similar. Remember, aa.search()
does not just look at the beginning of the string, but scans the entire string for a match.
# Search for the first block of characters
search_out = aa.search(gb_input)
# Show result
print(search_out.group())
print(search_out.span())
Now, we'll use a powerful tool. We'll find and replace using a regex. For this example, we will replace all positively chared residues (arginine (a
), lysine (l
), and histodine (h
)), and replace them with a +
sign. To do this, we use the sub()
method of a compiled regex.
# Compile our search string
positive_residues = re.compile('[rkh]')
# Find and replace
positive_residues.sub('+', seq)
# In addition to substituting subn also counts the number of substitution.
# It returns a tuple consisting of the modified string and the number of
# substitutions
pos_seq, n_pos = positive_residues.subn('+', seq)
# Print the result
print("The following sequence contains", n_pos,
"positively charged residues:\n")
print(pos_seq)
The previous examples illustrate that RE are a powerful tool to parse data. Moreover they can be of great use in analysing sequences.
The file aligned.fasta
contains seven aligned sequences in fasta format.
Let's have a look:
!cat ../data/aligned.fasta
Each entry starts with a >
sign that describes the source of the sequence, and then the aligned sequence of amino acids follow. As we see, the first sequence is described by
>gi|488942278|ref|WP_002853353.1|:1-279 glutamine ABC transporter substrate-binding protein [Campylobacter jejuni]
It is often convenient to just have a short species identifer instead of the full description. So, we would replace the above description with
>C.jejuni
To do this, we need to do the following steps.
# read the aligned fasta file
with open("../data/aligned.fasta", "r") as myfile:
aln = myfile.read()
# add an extra delimiter to enable splitting
aln = re.sub('>', 'delimiter>', aln)
# Use re.split to create a list of fasta enries.
aln = re.split('delimiter', aln)
# remove the first empty string from the list, since it got split
aln = aln[1:]
# RE to find the organism name: look for text in brackets and make
# convenient groups for parsing genus and species
get_organism = re.compile('\[(\w+) (\w+)\]')
# RE that defines the line to be edited
def_line = re.compile('(>.*)')
# A list that we'll concatenate into the file
edited_list = []
for fasta in aln:
# Get the organism name
org = get_organism.search(fasta)
# Abbreviate the organism's name in the desired form
abbr = '>' + org.group(1)[0] + '. ' + org.group(2)
# Replace the original sequence description with abbreviated organism name.
out = def_line.sub(abbr, fasta)
# Add out new string to the list
edited_list.append(out)
# Write out the result
with open('edited_fasta.fasta', 'w') as outfile:
outfile.write(''.join(edited_list))
Let's see how we did!
!cat edited_fasta.fasta
Success!
Even though we've just scratched the surface, I hope this tutorial has shown how powerful you can be if you have a command of regular expressions. The learning curve for these things is steep, but once you've used them for a while, they become second nature and parsing text is not a daunting task!