(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This tutorial was generated from a Jupyter notebook. You can download the notebook here.
In the last tutorial, we wrote some functions to parse strings and compute things like reverse complements. This helped us practice using functions and the iteration skills we learned.
You might think, "Hey, replacing characters in strings sounds like it may be pretty common." You'd be right. You might also think, "I bet someone, possibly someone who is a really good programmer, already has written code to do this." You would also be right.
For common tasks, there are often already methods written by someone smart, and working with strings is no different. In this lesson, we will explore some of the string processing tools that come with Python's standard library.
Before getting into string methods, we pause to note that indexing and slicing of strings works just as it does for lists and tuples.
my_str = 'The Dude abides.'
print(my_str[5])
print(my_str[:6])
print(my_str[::2])
print(my_str[::-1])
We'll start by revisiting some of the examples we've seen so far.
If you remember from iteration lesson, we started by computing the GC content of a nucleic acid sequence. We counted the occurrences of 'G'
and 'C'
in the string using a for
loop. We can use the count()
string method do to this.
# Define sequence
seq = 'GACAGACUCCAUGCACGUGGGUAUCAUGUC'
# Count G's and C's
seq.count('G') + seq.count('C')
The seq.count()
method enabled us to count the number times G
and C
occurred in the string seq
. This notation is new. We have a variable, followed by a dot (.
), and then a function. These functions are called methods in the language of object-oriented programming (OOP). If you have a string my_str
, and you want to execute one of Python's built-in string methods on it, the syntax is
my_str.string_method_of_choice(*args)
In general, the count
method gives the number of times a substring appears in a string. We can learn more about its behavior by playing with it.
# Substrings of more than one characater
seq.count('GAC')
# Substrings cannot overlap
'AAAAAAA'.count('AA')
# Something that's not there.
seq.count('nonsense')
A later task in the iteration lesson was to find the index of the start codon in an RNA sequence. Let's do it with another string method.
seq.find('AUG')
Wow, that was easy. The find()
method gives the index where the substring argument first appears. But, what if a substring is not in the string?
seq.find('nonsense')
In this case, find()
returns -1
. This is not to be interpreted as index -1
! find()
always returns positive indices. Note that you should not use find()
to test if a substring is present. Use the in
operator we already learned about.
Let's say we wanted to find the last instance of the start codon. We basically want to search from the right.
seq.rfind('AUG')
This is exactly what the rfind()
method does.
In our tutorial on functions, we wrote a function to compute a complementary base comparing against both the capital and lowercase letter. Here is that function implemented with some handy string methods.
def complement_base(base):
"""Returns the Watson-Crick complement of a base."""
# Convert to lowercase
base = base.lower()
if base == 'a':
return 'T'
elif base == 't':
return 'A'
elif base == 'g':
return 'C'
else:
return 'G'
We were able to avoid all the or
operations by just converting the base to lowercase using the lower()
method. In general, the lower()
method takes a string and converts any capital letters to lower case. The upper()
function works analogously.
'LeBron James'.lower()
'Make me aLl caPS.'.upper()
We also updated the complementary base function to account for RNA or DNA. Perhaps an easier way is just to replace all U
s in an RNA sequence with T
s to get n DNA sequence. The replace()
method makes this easy.
seq.replace('U', 'T')
Note that seq
did not change. Remember, strings are immutable, so the replace()
method returns a new string, as does lower()
, upper()
, and any other string method that returns a string.
seq
join()
method¶One of the most useful string methods is the join()
method. Say we have a list of words that we want to craft into a sentence.
word_tuple = ('The', 'Dude', 'abides.')
Now, we would like to concatenate them into a single string. (This is sort of like the opposite of taking a string and making a list of its characters by doing a list()
type conversion.) We need to know what we want to put between each word. In this case, we want a space. Here's the nifty syntax to do that.
' '.join(word_tuple)
We now have a single string with the elements of the tuple, separated by spaces. The string before the dot (.
) specifies what goes between the strings in the list or tuple (or other iterable). If we wanted "*
" between each word, we could do that, too.
' * '.join(word_tuple)
format()
method¶The format()
method is very powerful. We not go over all use cases here, but I'll show you what I think is most intuitive and commonly used. Again, this is best learned by example.
my_str = """
Let's do a Mad Lib!
During this bootcamp, I feel {adjective}.
The instructors give us {plural_noun}.
""".format(adjective='truculent', plural_noun='haircuts')
print(my_str)
See the pattern? Given a string, the format()
method takes kwargs that are themselves strings. Within the string, the name of the kwargs are given in braces. Then, the arguments in the format()
method inserts the strings at the places delimited by braces.
Now, what if we want to insert a number into a string? We could convert it to a string, but we should instead use string conversions. These are short directives that specify how the number should be represented in a string. A complete list is here. The table below shows some that are commonly used.
conversion | description |
---|---|
d |
integer |
04d |
integer with four digits, possibly with leading zeros |
f |
float, default to six digits after decimal |
.8f |
float with 8 digits after the decimal |
e |
scientific notation, default to six digits after decimal |
.16e |
scientific notation with 16 digits after the decimal |
s |
display as a string |
Below are examples of all of these.
print('There are {n:d} states.'.format(n=50))
print('Your file number is {n:d}.'.format(n=23))
print('π is approximately {pi:f}.'.format(pi=3.14))
print('e is approximately {e:.8f}.'.format(e=2.7182818284590451))
print('Avogadro''s number is approximately {N_A:e}.'.format(N_A=6.022e23))
print('ε_0 is approximately {eps_0:.16e} F/m.'.format(eps_0=8.854187817e-12))
print('That {thing:s} really tied the room together.'.format(thing='rug'))
Note the syntax. In the braces, we specify the name of the kwarg, and then we put a colon followed by the string conversion.
You can find a complete list of string methods from the Python doc pages. Various methods will come in handy when parsing strings going forward.