Lesson 13: Dictionaries (hash tables)

(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

Mapping objects and dictionaries

A mapping object allows an arbitrary collection of objects to be indexed by an arbitrary collection of values. That's a mouthful. It is easier to understand instead by comparing to a sequence.

Let's take a sequence of two strings, say a tuple containing a first and last name.

name = ('jane', 'doe')

We are restricted on how we reference the sequence. I.e., the first name is name[0], and the last name is name[1]. A mapping object could instead be indexed like name['first name'] and name['last name']. You can imagine this would be very useful! A classic example in biology might be looking up amino acids that are coded for by given codons. E.g., you might want

aa['CTT']

to give you 'Leucine'.

Dictionaries are the only built-in mapping type in Python. You might imagine that the Oxford English Dictionary might conveniently be stored as a dictionary. I.e., you would not want to store definitions that have to be referenced like

oed[103829]

Rather, you would like to get definitions like this:

ode['computer']

Importantly, note that Python dictionaries have no sense of order. Unlike lists and tuples, which are ordered collections, dictionaries have no order. The key is used to fetch the values from memory.

The only exception to this rule is if you use integers as your key values. If integers are used as the key values, then Python will automatically sort the dictionary such that key:value pairs are listed in ascending order with respect to the key.

Dictionary syntax

The syntax for creating a dictionary, as usual, is best seen through example.

In [5]:
my_dict = {'a': 6, 'b': 7, 'c': 27.6}
my_dict
Out[5]:
{'a': 6, 'b': 7, 'c': 27.6}

A dictionary is created using curly braces ({}). Each entry has a key, followed by a color, and then the value associated with the key. In the example above, the keys are all strings, which is the most common use case. Note that the items can be of any type; in the above example, they are ints and a float.

We could also create the dictionary using the built-in dict() function, which expects a tuple of 2-tuples, each one containing a key-value pair.

In [6]:
dict((('a', 6), ('b', 7), ('c', 27.6)))
Out[6]:
{'a': 6, 'b': 7, 'c': 27.6}

Finally, we can make a dictionary with keyword arguments to the dict() function.

In [7]:
dict(a='yes', b='no', c='maybe')
Out[7]:
{'a': 'yes', 'b': 'no', 'c': 'maybe'}

We do not need to have strings as the keys. In fact, any immutable object can be a key.

In [8]:
my_dict = {0: 'zero',
           1.7: [1, 2, 3],
           (5, 6, 'dummy string'): 3.14,
           'strings are immutable': 42}
my_dict
Out[8]:
{0: 'zero',
 1.7: [1, 2, 3],
 (5, 6, 'dummy string'): 3.14,
 'strings are immutable': 42}

However, mutable objects cannot be used as keys.

In [9]:
my_dict = {'mutable is ok': 1,
           ['immutable', 'not', 'ok']: 5}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-13687bbbeeaa> in <module>()
      1 my_dict = {'mutable is ok': 1,
----> 2            ['immutable', 'not', 'ok']: 5}

TypeError: unhashable type: 'list'

Useful dictionaries in bioinformatics

It might be useful to quickly look up 3-letter amino acid codes. Dictionaries are particularly useful for this.

In [10]:
aa_dict = {'A': 'Ala',
           'R': 'Arg',
           'N': 'Asn',
           'D': 'Asp',
           'C': 'Cys',
           'Q': 'Gln',
           'E': 'Glu',
           'G': 'Gly',
           'H': 'His',
           'I': 'Ile',
           'L': 'Leu',
           'K': 'Lys',
           'M': 'Met',
           'F': 'Phe',
           'P': 'Pro',
           'S': 'Ser',
           'T': 'Thr',
           'W': 'Trp',
           'Y': 'Tyr',
           'V': 'Val'}

Another useful dictionary would contain the set of codons and the amino acids they code for. This is build in the code below using the zip() function we learned before. To see the logic on how this is constructed, see the codon table here.

In [11]:
# The set of DNA bases
bases = ['T', 'C', 'A', 'G']

# Build list of codons
codon_list = []
for first_base in bases:
    for second_base in bases:
        for third_base in bases:
            codon_list += [first_base + second_base + third_base]

# The amino acids that are coded for (* = STOP codon)
amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'

# Build dictionary from tuple of 2-tuples (technically an iterator, but it works)
codons = dict(zip(codon_list, amino_acids))

# Show that we did it
print(codons)
{'CGC': 'R', 'AAC': 'N', 'AGG': 'R', 'CCC': 'P', 'ACG': 'T', 'ACC': 'T', 'TAG': '*', 'CAA': 'Q', 'TGA': '*', 'GGC': 'G', 'CGT': 'R', 'CGA': 'R', 'CTC': 'L', 'TAT': 'Y', 'TGT': 'C', 'GTC': 'V', 'GCC': 'A', 'AAG': 'K', 'TAA': '*', 'TCT': 'S', 'CAG': 'Q', 'AGC': 'S', 'TTA': 'L', 'ATC': 'I', 'CAC': 'H', 'AAT': 'N', 'GTA': 'V', 'CCA': 'P', 'GAA': 'E', 'ACA': 'T', 'AGT': 'S', 'TGC': 'C', 'TTG': 'L', 'ACT': 'T', 'TGG': 'W', 'GAC': 'D', 'GGG': 'G', 'CCT': 'P', 'GCG': 'A', 'TTT': 'F', 'CTT': 'L', 'GTG': 'V', 'GGT': 'G', 'GAG': 'E', 'ATA': 'I', 'AGA': 'R', 'TCA': 'S', 'GCA': 'A', 'ATG': 'M', 'TTC': 'F', 'AAA': 'K', 'ATT': 'I', 'TCG': 'S', 'CAT': 'H', 'CTA': 'L', 'TCC': 'S', 'CCG': 'P', 'GGA': 'G', 'GAT': 'D', 'CGG': 'R', 'GCT': 'A', 'GTT': 'V', 'CTG': 'L', 'TAC': 'Y'}

These two dictionaries are particularly useful, so I put them in a little module in the bootcamp repo that you can import.

In [12]:
import bioinfo_dicts as bd
bd.aa, bd.codons
Out[12]:
({'A': 'Ala',
  'C': 'Cys',
  'D': 'Asp',
  'E': 'Glu',
  'F': 'Phe',
  'G': 'Gly',
  'H': 'His',
  'I': 'Ile',
  'K': 'Lys',
  'L': 'Leu',
  'M': 'Met',
  'N': 'Asn',
  'P': 'Pro',
  'Q': 'Gln',
  'R': 'Arg',
  'S': 'Ser',
  'T': 'Thr',
  'V': 'Val',
  'W': 'Trp',
  'Y': 'Tyr'},
 {'AAA': 'K',
  'AAC': 'N',
  'AAG': 'K',
  'AAT': 'N',
  'ACA': 'T',
  'ACC': 'T',
  'ACG': 'T',
  'ACT': 'T',
  'AGA': 'R',
  'AGC': 'S',
  'AGG': 'R',
  'AGT': 'S',
  'ATA': 'I',
  'ATC': 'I',
  'ATG': 'M',
  'ATT': 'I',
  'CAA': 'Q',
  'CAC': 'H',
  'CAG': 'Q',
  'CAT': 'H',
  'CCA': 'P',
  'CCC': 'P',
  'CCG': 'P',
  'CCT': 'P',
  'CGA': 'R',
  'CGC': 'R',
  'CGG': 'R',
  'CGT': 'R',
  'CTA': 'L',
  'CTC': 'L',
  'CTG': 'L',
  'CTT': 'L',
  'GAA': 'E',
  'GAC': 'D',
  'GAG': 'E',
  'GAT': 'D',
  'GCA': 'A',
  'GCC': 'A',
  'GCG': 'A',
  'GCT': 'A',
  'GGA': 'G',
  'GGC': 'G',
  'GGG': 'G',
  'GGT': 'G',
  'GTA': 'V',
  'GTC': 'V',
  'GTG': 'V',
  'GTT': 'V',
  'TAA': '*',
  'TAC': 'Y',
  'TAG': '*',
  'TAT': 'Y',
  'TCA': 'S',
  'TCC': 'S',
  'TCG': 'S',
  'TCT': 'S',
  'TGA': '*',
  'TGC': 'C',
  'TGG': 'W',
  'TGT': 'C',
  'TTA': 'L',
  'TTC': 'F',
  'TTG': 'L',
  'TTT': 'F'})

A dictionary is an implementation of a hash table

It is useful to stop and think about how a dictionary works. Let's create a dictionary and look at where the values are stored in memory.

In [13]:
# Create dictionary
my_dict = {'a': 6, 'b': 7, 'c':12.6}

# Find where they are stored
print(id(my_dict))
print(id(my_dict['a']))
print(id(my_dict['b']))
print(id(my_dict['c']))
4411194760
4297148688
4297148720
4394634960

So, each entry in the dictionary is stored at a different location in memory. The dictionary itself also have its own address. So, when I index a dictionary with a key, how does the dictionary know which address in memory to use to fetch the value I am interested in?

Dictionaries use a hash function to do this. A hash function converts its input to an integer. For example, we can use Python's built-in hash function to convert the keys to integers.

In [14]:
hash('a'), hash('b'), hash('c')
Out[14]:
(2661603812129973265, 2874210410368167821, -6715514952858417912)

Under the hood, Python then converts these integers to integers that could correspond to locations in memory. A collection of elements that can be indexed this way is called a hash table. This is a very common data structure in computing. Wikipedia has a pretty good discussion on them.

Now, given what you know about how dictionaries work, why do you think immutable objects are not acceptable as keys?

Dictionaries are mutable

The title says it all! Dictionaries are mutable. This means that they can be changed in place. For example, if we want to add an element to a dictionary, we use simple syntax.

In [15]:
# Remind ourselves what the dictionary is
print(my_dict)

# Add an entry
my_dict['d'] = 'Bootcamp is so much fun!'

# Look at dictionary again
print(my_dict)

# Change an entry
my_dict['a'] = 'I was not satisfied with entry a.'

# Look at it again
print(my_dict)
{'c': 12.6, 'a': 6, 'b': 7}
{'c': 12.6, 'a': 6, 'b': 7, 'd': 'Bootcamp is so much fun!'}
{'c': 12.6, 'a': 'I was not satisfied with entry a.', 'b': 7, 'd': 'Bootcamp is so much fun!'}

Membership operators with dictionaries

The in and not in operators work with dictionaries, but both only query keys. We see this again by example.

In [16]:
# Make a fresh my_dict
my_dict = {'a': 1, 'b': 2, 'c': 3}

# in works with keys
'b' in my_dict, 'd' in my_dict, 'e' not in my_dict
Out[16]:
(True, False, True)
In [17]:
# Try it with values
2 in my_dict
Out[17]:
False

Yup! We get False. Why? Because 2 is not a key in my_dict.

We can also iterate over the keys in a dictionary.

In [18]:
for key in my_dict:
    print(key, ':', my_dict[key])
c : 3
a : 1
b : 2

The best, and preferred, method, is to iterate over key,item pairs in a dictionary using the items of a dictionary.

In [21]:
for key, item in my_dict.items():
    print(key, item)
c 3
a 1
b 2

Note, however, that like lists, the items that come out of the my_dict.items() iterator are not items in the dictionary. If you make changes within the for loop, you will not change entries in the dictionary.

In [22]:
for key, item in my_dict.items():
    item = 'this string will not be in dictionary.'
    
my_dict
Out[22]:
{'a': 1, 'b': 2, 'c': 3}

You will, though, if you use the keys.

In [23]:
for key, _ in my_dict.items():
    my_dict[key] = 'this will be in the dictionary.'
    
my_dict
Out[23]:
{'a': 'this will be in the dictionary.',
 'b': 'this will be in the dictionary.',
 'c': 'this will be in the dictionary.'}

Built-in functions for dictionaries

The built-in len() function and del operation work on dictionaries.

  • len(d) gives the number of entries in dictionary d
  • del d[k] deletes entry with key k from dictionary d

This is the first time we've encountered the del keyword. This keyword is used to delete variables and their values from memory. The del keyword can also be to delete items from lists. Let's see things in practice.

In [19]:
# Create my_list and my_dict for reference
my_dict = dict(a=1, b=2, c=3, d=4)
my_list = [1, 2, 3, 4]

# Print them
print('my_dict:', my_dict)
print('my_list:', my_list)

# Get lengths
print('length of my_dict:', len(my_dict))
print('length of my_list:', len(my_list))

# Delete a key from my_dict
del my_dict['b']

# Delete entry from my_list
del my_list[1]

# Show post-deleted objects
print('post-deleted my_dict:', my_dict)
print('post-deleted my_list:', my_list)
my_dict: {'d': 4, 'c': 3, 'b': 2, 'a': 1}
my_list: [1, 2, 3, 4]
length of my_dict: 4
length of my_list: 4
post-deleted my_dict: {'d': 4, 'c': 3, 'a': 1}
post-deleted my_list: [1, 3, 4]

Note, though, that you cannot delete an item from a tuple.

In [20]:
my_tuple = (1, 2, 3, 4)
del my_tuple[1]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-477cc61ded22> in <module>()
      1 my_tuple = (1, 2, 3, 4)
----> 2 del my_tuple[1]

TypeError: 'tuple' object doesn't support item deletion

Dictionary methods

Dictionaries have several built-in methods. Following are a few of them, assuming the dictionary is d.

method effect
d.keys() return keys as a tuple
d.pop(key) return value associated with key and delete key from d
d.values() return the (key, value) pairs in d as a tuple

Let's try these out.

In [21]:
my_dict = dict(a=1, b=2, c=3, d=4)
my_dict.keys()
Out[21]:
dict_keys(['d', 'c', 'b', 'a'])

Note that this is a dict_keys object. We cannot index it. If, say, we wanted to sort the keys and have them index-able, we would have to convert them to a list.

In [22]:
sorted(list(my_dict.keys()))
Out[22]:
['a', 'b', 'c', 'd']

Now let's try popping an entry out of the dictionary.

In [23]:
my_dict.pop('c')
Out[23]:
3
In [24]:
my_dict
Out[24]:
{'a': 1, 'b': 2, 'd': 4}

...and, as we expect, key 'c' is now deleted, and its value was returned in the call to my_dict.pop('c'). Finally, we can look at the values.

In [25]:
my_dict.values()
Out[25]:
dict_values([4, 2, 1])

We get a dict_values object, similar to the dict_keys object we got with the my_dict.keys() method.

You can get more information about build-in methods from the Python documentation.

List methods

As you may guess, the dictionary method pop() has an analog that works for lists. (Why don't the dictionary key() and values() methods work for lists?) We take this opportunity to introduce a few more useful list methods. Imagine the list is called s.

method effect
s.pop(i) return value at index i and delete it from the list
s.append(x) Put x at the end of the list
s.insert(i, x) Insert x at index i in the list
s.remove(x) Remove the first occurence of x from the list
s.reverse() Reverse the order of items in the list

Using dictionaries as kwargs

A nifty feature of dictionaries is that they can be passed into functions as keyword arguments. We covered named keyword arguments in the Intro to functions lesson. In addition to the named keyword arguments, a function can take in arbitrary keyword arguments (not arbitrary non-keyword arguments). This is specified in the function definition by including a last argument with a double-asterisk, **.

In [26]:
def my_fun(a, b, **kwargs):
    """Concatenate sequences."""
    seq = a + b

    for key in kwargs:
        seq += kwargs[key]
        
    return seq

Note that kwargs ends up being passed in as a dictionary. Let's try it.

In [27]:
my_fun('TGACAC', 'CAGGGA', c='GGGGGGGGG', d='AAAATTTTT')
Out[27]:
'TGACACCAGGGAAAAATTTTTGGGGGGGGG'

Now, imagine we have a dictionary that contains our values.

In [28]:
my_dict = {'a': 'TGACAC', 
           'b': 'CAGGGA', 
           'c': 'GGGGGGGGG', 
           'd': 'AAAATTTTT'}

We can now pass this directly into the function by preceding it with a double asterisk.

In [29]:
my_fun(**my_dict)
Out[29]:
'TGACACCAGGGAGGGGGGGGGAAAATTTTT'

Beautiful! This example is kind of trivial, but you can imagine that it can come in handy, e.g. with large sets of sequence fragments that you read in from a file.