{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 6: Iteration\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We often want a computer program to do a similar operation many times. For example, we may want to analyze many codons, one after another, and find the start and stop codons in order to determine the length of the open reading frame. Or, as a simpler example, we may wish to find the GC content of a specific sequence. We check each base to see if it is `G` or `C` and keep a count. Doing these repeated operations in a program is called **iteration**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introducing the for loop\n", "\n", "The first method of iteration we will discuss is the **for loop**. As an example of its use, we compute the GC content of an RNA sequence." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5517241379310345" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The sequence we want to analyze\n", "seq = 'GACAGACUCCAUGCACGUGGGUAUCUGUC'\n", "\n", "# Initialize GC counter\n", "n_gc = 0\n", "\n", "# Initialize sequence length\n", "len_seq = 0\n", "\n", "# Loop through sequence and count G's and C's\n", "for base in seq:\n", " len_seq += 1\n", " if base in 'GCgc':\n", " n_gc += 1\n", " \n", "# Divide to get GC content\n", "n_gc / len_seq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look carefully at what we did here. We took a string containing a sequence of nucleotides and then we did something for each character (base) in that string (nucleic acid sequence). A string is a ***sequence*** in the sense of the programming language as well; just like a list or tuple, the string is an ordered collection of characters. (So as not to confuse between biological sequences and *sequences* as a part of the Python language, we will always write the latter in italics.)\n", "\n", "Now, let's translate the new syntax in the above code to English.\n", "\n", "**Python**: `for base in seq`:\n", "\n", "**English**: for each character in the string whose variable name is `seq`, do the following, with that character taking the name `base`\n", "\n", "This exposes a general way of doing things repeatedly in Python. For every item in a _sequence_, we do something. That something follows the `for` clause and is contained in an indentation block. When we do this, we say are \"looping over a *sequence*.\" In the context of a `for` clause, the membership operator, `in`, means that we consider in order each item in the *sequence* or iterator (we'll talk about iterators in a moment).\n", "\n", "Now, looking within the loop, the first thing we do is increment the length of the sequence. For each base we encounter in the loop, we add one to the sequence length. Makes sense!\n", "\n", "Next, we have an `if` statement. We use the membership operator again. We ask if the current base is a `G` or a `C`. To be safe, we also included lower case characters in case the sequence was entered that way. If the base is a `G` or a `C`, we increment the counter of GC bases by one.\n", "\n", "Finally, we get the fractional GC content by dividing the number of GC's by the total length of the sequence." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Aside: the len() function\n", "\n", "Note in the last example that we determined the length of the RNA sequence by iterating the `len_seq` counter within the `for` loop. This works, but Python has a built-in function (like `print()` is a built-in function) to compute the length of a string (or list, tuple, or other *sequence*). To find out the length of a *sequence*, simply use it as an argument to the `len()` function." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "29" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(seq)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example and watchout: modifying a list\n", "\n", "Let's look at another example of iterating through a list. Say we have a list of integers, and we want to change it by doubling each one. Let's just do it." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 2, 3, 4, 5]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We'll do one through 5\n", "my_integers = [1, 2, 3, 4, 5]\n", "\n", "# Double each one\n", "for n in my_integers:\n", " n *= 2\n", " \n", "# Check out the result\n", "my_integers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whoa! It didn't seem to double any of the integers! This is because `my_integers` was converted to an **iterator** in the `for` clause, and the iterator returns a *copy* of the item in a list, not a reference to it. Therefore, the `n` inside the `for` block is not a view into the original list, and doubling it does nothing meaningful.\n", "\n", "We've seen how to change individual list elements with indexing:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2, 4, 6, 8, 10]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Don't do things this way\n", "my_integers[0] *= 2\n", "my_integers[1] *= 2\n", "my_integers[2] *= 2\n", "my_integers[3] *= 2\n", "my_integers[4] *= 2\n", "\n", "my_integers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But we'd obviously like a better way to do this, with less typing and without knowing ahead of time the length of the list. Let's look at a new concept that will help with this example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Iterators\n", "\n", "In the previous example, we iterated over a *sequence*. A sequence is one of many iterable objects, called **iterables**. Under the hood, the Python interpreter actually converts an iterable to an **iterator**. An iterator is a special object that gives values in succession. A major difference between a *sequence* and an iterator is that you cannot index an iterator. This seems like a trivial difference, but iterators make for more efficient computing than directly using a *sequence* with indexing.\n", "\n", "We can explicitly convert a *sequence* to an iterator using the built-in function `iter()`, but we will not bother with that here because the Python interpreter automatically does this for you when you use a *sequence* in a loop. (This is incidentally why the previous example didn't work; when the list is converted to an iterator, a copy is made of each element so the original list is unchanged.)\n", "\n", "Instead, we will now explore how we can create useful iterators using the `range()`, `enumerate()`, and `zip()` functions. I know we have not yet covered functions, but the syntax should not be so complicated that you cannot understand what these functions are doing, just like with the `print()` and `len()` functions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The range() function\n", "\n", "The `range()` function gives an iterable that enables counting. Let's look at an example." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1 2 3 4 5 6 7 8 9 " ] } ], "source": [ "for i in range(10):\n", " print(i, end=' ')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that `range(10)` gives us ten numbers, from `0` to `9`. As with indexing, `range()` inclusively starts at zero by default, and the ending is exclusive.\n", "\n", "It turns out that the arguments of the `range()` function work much like indexing. If you have a single argument, you get that many integers, starting at 0 and incrementing by one. If you give two arguments, you start inclusively at the first and increment by one ending exclusively at the second argument. Finally, you can specify a stride with the third argument." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 3 4 5 6 7 8 9 \n", "2 4 6 8 " ] } ], "source": [ "# Print numbers 2 through 9\n", "for i in range(2, 10):\n", " print(i, end=' ')\n", "\n", "# Print a newline\n", "print()\n", " \n", "# Print even numbers, 2 through 9\n", "for i in range(2, 10, 2):\n", " print(i, end=' ')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is often useful to make a list or tuple that has the same entries that a corresponding `range` object would have. We can do this with type conversion." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(range(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the `range()` function along with the `len()` function on lists to double the elements of our list from a bit ago:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2, 4, 6, 8, 10]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_integers = [1, 2, 3, 4, 5]\n", "\n", "# Since len(my_integers) = 5, this takes i from 0 to 4, \n", "# exactly the indices of my_integers\n", "for i in range(len(my_integers)):\n", " my_integers[i] *= 2\n", " \n", "my_integers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This works, but there is an even better way to do this, with the next function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The enumerate() function\n", "\n", "Let's say we want to print the indices of all `G` bases in a DNA sequence. We could do this by modifying our previous program." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 4 12 16 18 19 20 26 " ] } ], "source": [ "# Initialize GC counter\n", "n_gc = 0\n", "\n", "# Initialized sequence length\n", "len_seq = 0\n", "\n", "# Loop through sequence and print index of G's\n", "for base in seq:\n", " if base in 'Gg':\n", " print(len_seq, end=' ')\n", " len_seq += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is not so bad, but there is an easier way to do this. The `enumerate()` function gives an iterator that provides *both* the index and the item of a *sequence*. Again, this is best demonstrated in practice." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 4 12 16 18 19 20 26 " ] } ], "source": [ "# Loop through sequence and print index of G's\n", "for i, base in enumerate(seq):\n", " if base in 'Gg':\n", " print(i, end=' ')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `enumerate()` function allowed us to use an index and a base at the same time. To make it more clear, we can print the index and base type for each base in the sequence." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 G\n", "1 A\n", "2 C\n", "3 A\n", "4 G\n", "5 A\n", "6 C\n", "7 U\n", "8 C\n", "9 C\n", "10 A\n", "11 U\n", "12 G\n", "13 C\n", "14 A\n", "15 C\n", "16 G\n", "17 U\n", "18 G\n", "19 G\n", "20 G\n", "21 U\n", "22 A\n", "23 U\n", "24 C\n", "25 U\n", "26 G\n", "27 U\n", "28 C\n" ] } ], "source": [ "# Print index and identity of bases\n", "for i, base in enumerate(seq):\n", " print(i, base)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `enumerate()` function is really useful and should be used in favor of just doing indexing. For example, many programmers, especially those first trained in lower-level languages, would write the above code similar to how we did the list doubling, with the `range()` and `len()` functions, but this is not good practice in Python.\n", "\n", "Using `enumerate()`, the list doubling code looks like this:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2, 4, 6, 8, 10]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_integers = [1, 2, 3, 4, 5]\n", "\n", "# Double each one\n", "for i, _ in enumerate(my_integers):\n", " my_integers[i] *= 2\n", " \n", "# Check out the result\n", "my_integers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`enumerate()` is more generic and the overhead for returning a reference to an object isn't an issue. The `range(len())` construct will break on an object without support for `len()`. In addition, you are more likely to introduce bugs by imposing indexing on objects that are iterable but not unambiguously indexable. It is better to use the `enumerate()` function.\n", "\n", "Note that we used the underscore, `_`, as a throwaway variable that we do not use. There is no rule for this, but this is generally accepted Python syntax and helps signal that you are not going to use the variable.\n", "\n", "One last gotcha: if we tried to do a similar technique with a string, we get a `TypeError` because a string is immutable. We'll revisit examples like this in [lesson 8](l08_string_methods.ipynb)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "ename": "TypeError", "evalue": "'str' object does not support item assignment", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbase\u001b[0m \u001b[0;32min\u001b[0m \u001b[0menumerate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mseq\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mbase\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'G'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mseq\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'g'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mTypeError\u001b[0m: 'str' object does not support item assignment" ] } ], "source": [ "# Try to convert capital G to lower g\n", "for i, base in enumerate(seq):\n", " if base == 'G':\n", " seq[i] = 'g'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The zip() function\n", "\n", "The `zip()` function enables us to iterate over several iterables at once. In the example below we iterate over the jersey numbers, names, and positions of the players on the US women's national soccer team who were named to the best eleven of the 2019 World Cup in France." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "19 Dunn D\n", "8 Ertz MF\n", "16 Lavelle MF\n", "15 Rapinoe F\n" ] } ], "source": [ "names = ('Dunn', 'Ertz', 'Lavelle', 'Rapinoe')\n", "positions = ('D', 'MF', 'MF', 'F')\n", "numbers = (19, 8, 16, 15)\n", "\n", "for num, pos, name in zip(numbers, positions, names):\n", " print(num, name, pos)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The reversed() function\n", "\n", "This function is useful for giving an iterator that goes in the reverse direction. We'll see that this can be convenient in the next lesson. For now, let's pretend we're NASA and need to count down." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10\n", "9\n", "8\n", "7\n", "6\n", "5\n", "4\n", "3\n", "2\n", "1\n", "ignition\n" ] } ], "source": [ "count_up = ('ignition', 1, 2, 3, 4, 5, 6, 7, 8 ,9, 10)\n", "\n", "for count in reversed(count_up):\n", " print(count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The while loop\n", "\n", "The `for` loop is very powerful and allows us to construct iterative calculations. When we use a `for` loop, we need to set up an iterator. A `while` loop, on the other hand, allows iteration until a conditional expression evaluates `False`.\n", "\n", "As an example of a `while` loop, we will calculate the length of a sequence before hitting a start codon." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The start codon starts at index 10\n" ] } ], "source": [ "# Define start codon\n", "start_codon = 'AUG'\n", "\n", "# Initialize sequence index\n", "i = 0\n", "\n", "# Scan sequence until we hit the start codon\n", "while seq[i:i+3] != start_codon:\n", " i += 1\n", " \n", "# Show the result\n", "print('The start codon starts at index', i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's walk through the `while` loop. The value of `i` is changing with each iteration, being incremented by one. Each time we consider doing another iteration, the conditional is checked: do the next three bases match the start codon? We set up the conditional to evaluate to `True` when the bases are *not* the start codon, so the iteration continues. In other words, iteration continues in a `while` loop until the conditional returns `False`.\n", "\n", "Notice that we sliced the string the same way we sliced lists and tuples. In the case of strings, a slice gives another string, i.e., a sequential collection of characters.\n", "\n", "Let's try looking for another codon.... But, let's actually not do that. If you run the code below, it will run forever and nothing will get printed to the screen.\n", "\n", "```python\n", "# Define codon of interest\n", "codon = 'GCC'\n", "\n", "# Initialize sequence index\n", "i = 0\n", "\n", "# Scan sequence until we hit the start codon, but DON'T DO THIS!!!!!\n", "while seq[i:i+3] != codon:\n", " i += 1\n", " \n", "# Show the result\n", "print('The codon starts at index', i)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reason this runs forever is that the conditional expression in the `while` statement never returns `False`. If we slice a string beyond the length of the string we get an empty string result." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "''" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "seq[100:103]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This does not equal the codon we're interested in, so the **`while`** loop keeps going. Forever. This is called an **infinite loop**, and you definitely to not want these in your code! We can fix it by making a conditional that will evaluate to `False` if we reach the end of the string." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Codon not found in sequence.\n" ] } ], "source": [ "# Define codon of interest\n", "codon = 'GCC'\n", "\n", "# Initialize sequence index\n", "i = 0\n", "\n", "# Scan sequence until we hit the start codon or the end of the sequence\n", "while seq[i:i+3] != codon and i < len(seq):\n", " i += 1\n", " \n", "# Show the result\n", "if i == len(seq):\n", " print('Codon not found in sequence.')\n", "else:\n", " print('The codon starts at index', i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **for** vs **while**\n", "\n", "Most anything that requires a loop can be done with either a `for` loop or a `while` loop, but there's a general rule of thumb for which type of loop to use. If you know how many times you have to do something (or if your program knows), use a `for` loop. If you don't know how many times the loop needs to run until you run it, use a `while` loop. For example, when we want to do something with each character in a string or each entry in a list, the program knows how long the sequence is and a `for` loop is more appropriate. In the previous examples, we don't know how long it will be before we hit the start codon; it depends on the sequence you put into the program. That makes it more suited to a `while` loop." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The **break** and **else** keywords\n", "\n", "Iteration stops in a `for` loop when the iterator is exhausted. It stops in a `while` loop when the conditional evaluates to `False`. These is another way to stop iteration: the `break` keyword. Whenever `break` is encountered in a `for` or `while` loop, the iteration halts and execution continues outside the loop. As an example, we'll do the calculation above with a `for` loop with a `break` instead of a `while` loop." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The start codon starts at index 10\n" ] } ], "source": [ "# Define start codon\n", "start_codon = 'AUG'\n", "\n", "# Scan sequence until we hit the start codon\n", "for i in range(len(seq)):\n", " if seq[i:i+3] == start_codon:\n", " print('The start codon starts at index', i)\n", " break\n", "else:\n", " print('Codon not found in sequence.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that we have an `else` block after the `for` loop. In Python, `for` and `while` loops can have an `else` statement after the code block to be evaluated in the loop. The contents of the `else` block are evaluated if the loop completes *without* encountering a `break`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A note about use of the word \"codon\"\n", "\n", "The astute biologists among you will note that we have not really been using the word \"codon\" properly here. We are taking it to mean any three consecutive bases, but the more precise definition of a codon means that it is three consecutive bases that code for an amino acid. That means that for a three-base sequence to be a codon, it must be in-register with the start codon. We will see how this distinction could lead to problems when we discuss **test-driven development**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing environment" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPython 3.7.7\n", "IPython 7.13.0\n", "\n", "jupyterlab 1.2.6\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -v -p jupyterlab" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }