Lesson 8: String methods

(c) 2019 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This lesson was generated from an Jupyter notebook. You can download the notebook here.



In the last tutorial, we wrote some functions to parse strings and compute things like reverse complements. This helped us practice using functions and the iteration skills we learned.

You might think, "Hey, replacing characters in strings sounds like it may be pretty common." You'd be right. You might also think, "I bet someone, possibly someone who is a really good programmer, already has written code to do this." You would also be right.

For common tasks, there are often already methods written by someone smart, and working with strings is no different. In this lesson, we will explore some of the string processing tools that come with Python's standard library.

Indexing and slicing of strings

Before getting into string methods, we pause to note that indexing and slicing of strings works just as it does for lists and tuples.

In [1]:
my_str = 'The Dude abides.'

print(my_str[5])
print(my_str[:6])
print(my_str[::2])
print(my_str[::-1])
u
The Du
TeDd bds
.sediba eduD ehT

Revisiting previous examples using string methods

We'll start by revisiting some of the examples we've seen so far.

Computing GC content

If you remember from the iteration lesson, we started by computing the GC content of a nucleic acid sequence. We counted the occurrences of 'G' and 'C' in the string using a for loop. We can use the count() string method do to this.

In [2]:
# Define sequence
seq = 'GACAGACUCCAUGCACGUGGGUAUCAUGUC'

# Count G's and C's
seq.count('G') + seq.count('C')
Out[2]:
16

The seq.count() method enabled us to count the number times G and C occurred in the string seq. This notation is new. We have a variable, followed by a dot (.), and then a function. These functions are called methods in the language of object-oriented programming (OOP). If you have a string my_str, and you want to execute one of Python's built-in string methods on it, the syntax is

my_str.string_method_of_choice(*args)

In general, the count method gives the number of times a substring appears in a string. We can learn more about its behavior by playing with it.

In [3]:
# Substrings of more than one characater
seq.count('GAC')
Out[3]:
2
In [4]:
# Substrings cannot overlap
'AAAAAAA'.count('AA')
Out[4]:
3
In [5]:
# Something that's not there.
seq.count('nonsense')
Out[5]:
0

Finding the index of a start codon

A later task in the iteration lesson was to find the index of the start codon in an RNA sequence. Let's do it with another string method.

In [6]:
seq.find('AUG')
Out[6]:
10

Wow, that was easy. The find() method gives the index where the substring argument first appears. But, what if a substring is not in the string?

In [7]:
seq.find('nonsense')
Out[7]:
-1

In this case, find() returns -1. This is not to be interpreted as index -1! find() always returns positive indices if it finds a substring. Note that you should not use find() to test if a substring is present. Use the in operator we already learned about.

In [8]:
'AUG' in seq
Out[8]:
True

Finding the last index of a substring

Let's say we wanted to find the last instance of the start codon. We basically want to search from the right. This is exactly what the rfind() method does.

In [9]:
seq.rfind('AUG')
Out[9]:
25

Finding the complementary base

In our tutorial on functions, we wrote a function to compute a complementary base comparing against both the capital and lowercase letter. Here is that function implemented with some handy string methods.

In [10]:
def complement_base(base):
    """Returns the Watson-Crick complement of a base."""
    # Convert to lowercase
    base = base.lower()
    
    if base == 'a':
        return 'T'
    elif base == 't':
        return 'A'
    elif base == 'g':
        return 'C'
    else:
        return 'G'

We were able to avoid all the "base in 'Tt'"-style operations by just converting the base to lowercase using the lower() method. In general, the lower() method takes a string and converts any capital letters to lower case. The upper() function works analogously.

In [11]:
'LeBron James'.lower()
Out[11]:
'lebron james'
In [12]:
'Make me aLl caPS.'.upper()
Out[12]:
'MAKE ME ALL CAPS.'

Converting RNA to DNA

We also updated the complementary base function to account for RNA or DNA. Perhaps an easier way is just to replace all Us in an RNA sequence with Ts to get a DNA sequence. The replace() method makes this easy.

In [13]:
seq.replace('U', 'T')
Out[13]:
'GACAGACTCCATGCACGTGGGTATCATGTC'

Note that seq did not change. Remember, strings are immutable, so the replace() method returns a new string, as does lower(), upper(), and any other string method that returns a string. So, the characters stored in the variable seq are unchanged.

In [14]:
seq
Out[14]:
'GACAGACUCCAUGCACGUGGGUAUCAUGUC'

The join() method

One of the most useful string methods is the join() method. Say we have a list of words that we want to craft into a sentence.

In [15]:
word_tuple = ('The', 'Dude', 'abides.')

Now, we would like to concatenate them into a single string. (This is sort of like the opposite of taking a string and making a list of its characters by doing a list() type conversion.) We need to know what we want to put between each word. In this case, we want a space. Here's the nifty syntax to do that.

In [16]:
' '.join(word_tuple)
Out[16]:
'The Dude abides.'

We now have a single string with the elements of the tuple, separated by spaces. The string before the dot (.) specifies what goes between the strings in the list or tuple (or other iterable). If we wanted "*" between each word, we could do that, too.

In [17]:
' * '.join(word_tuple)
Out[17]:
'The * Dude * abides.'

The format() method

The format() method is very powerful. We not go over all use cases here, but I'll show you what I think is most intuitive and commonly used. Again, this is best learned by example.

In [18]:
my_str = """
Let's do a Mad Lib!
During this bootcamp, I feel {adjective}.
The instructors give us {plural_noun}.
""".format(adjective='truculent', plural_noun='haircuts')

print(my_str)
Let's do a Mad Lib!
During this bootcamp, I feel truculent.
The instructors give us haircuts.

See the pattern? Given a string, the format() method takes kwargs that are themselves strings. Within the string, the name of the kwargs are given in braces. Then, the arguments in the format() method inserts the strings at the places delimited by braces.

Now, what if we want to insert a number into a string? We could convert it to a string, but we should instead use string conversions. These are short directives that specify how the number should be represented in a string. A complete list is here. The table below shows some that are commonly used.

conversion description
d integer
04d integer with four digits, possibly with leading zeros
f float, default to six digits after decimal
.8f float with 8 digits after the decimal
e scientific notation, default to six digits after decimal
.16e scientific notation with 16 digits after the decimal
s display as a string

Below are examples of all of these.

In [19]:
print('There are {n:d} states in the US.'.format(n=50))
print('Your file number is {n:d}.'.format(n=23))
print('π is approximately {pi:f}.'.format(pi=3.14))
print('e is approximately {e:.8f}.'.format(e=2.7182818284590451))
print("Avogadro's number is approximately {N_A:e}.".format(N_A=6.022e23))
print('ε₀ is approximately {eps_0:.16e} F/m.'.format(eps_0=8.854187817e-12))
print('That {thing:s} really tied the room together.'.format(thing='rug'))
There are 50 states in the US.
Your file number is 23.
π is approximately 3.140000.
e is approximately 2.71828183.
Avogadro's number is approximately 6.022000e+23.
ε₀ is approximately 8.8541878170000005e-12 F/m.
That rug really tied the room together.

Note the syntax. In the braces, we specify the name of the kwarg, and then we put a colon followed by the string conversion. Note also that I used double quotes on the outside of the string containing Avogadro's number so that I could include an apostrophe in the string. Finally, note that we got a subscript zero using the Unicode character, .

f-strings

f-strings are strings that are prefixed with an f or F that allow convenient insertion of entries into strings. Here are some examples.

In [20]:
n_states = 50
file_number = 23
pi = 3.14
e = 2.7182818284590451
N_A = 6.022e23
eps_0=8.854187817e-12
thing = 'rug'

print(f'There are {n_states} states in the US.')
print(f'Your file number is {file_number}.')
print(f'π is approximately {pi}.')
print(f'e is approximately {e:.8f}.')
print(f"Avogadro's number is approximately {N_A}.")
print(f'ε₀ is approximately {eps_0} F/m.')
print(f'That {thing} really tied the room together.')
There are 50 states in the US.
Your file number is 23.
π is approximately 3.14.
e is approximately 2.71828183.
Avogadro's number is approximately 6.022e+23.
ε₀ is approximately 8.854187817e-12 F/m.
That rug really tied the room together.

There are many more string methods

You can find a complete list of string methods from the Python doc pages. Various methods will come in handy when parsing strings going forward.

Computing environment

In [21]:
%load_ext watermark
%watermark -v -p jupyterlab
CPython 3.7.3
IPython 7.1.1

jupyterlab 0.35.5