Lesson 14: File I/O

(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This tutorial was generated from a Jupyter notebook. You can download the notebook here.

In [21]:
# The only module we need is os from the standard library
import os

Reading data in from files and then writing your results out again is one of the most common practices in scientific computing. In this tutorial, we will learn about some of Python's File I/O capabilities. We will use a PDB file as an example. The PDB file contains the crystal structure for the tetramerization domain of p53. It is stored in the file data/1OLG.pdb. Note that 1OLG is its unique Protein Databank identifier.

File objects

To open a file, we use the built-in open() function. We'll invoke this function by example and look at its output.

In [6]:
# Open file
f = open('data/1OLG.pdb', 'r')

# What is f?
f
Out[6]:
<_io.TextIOWrapper name='data/1OLG.pdb' mode='r' encoding='UTF-8'>

So, f is some weird looking type. It is a Python file object, which has methods and attributes, just like any other object. We'll explore those in a moment, but first, let's look at how we opened the file. The first argument to open() is a string that has the name of the file, with the full path if necessary. The second argument is a string that says what we will be doing with the file. I.e., are we reading or writing to the file? The possible strings for this second argument are

string meaning
'r' open a text file for reading
'w' create and open a text file for writing
'a' append an existing text file
'r+' open a text file for reading and writing
append 'b' to any of the above same as above, except for binary files

We will mostly be working with text files in the bootcamp, so the first three are the most useful. A big warning, though....

Tying to open an existing file with `'w'` will wipe it out and create a new file.

Reading data out of the file with file object methods

Now, let's look at the file object. You can type f. followed by tab to see its attributes and methods. We will focus on the methods f.read() and f.readlines(). What do they do?

method task
f.read() Read the entire contents of the file into a string
f.readlines() Read the entire file into a list with each item being a string representing a line

First, we'll try using the first method to get a single string with the entire contents of the file.

In [7]:
# Read file into string
f_str = f.read()

# Let's look at the first 1000 characters
f_str[:1000]
Out[7]:
'HEADER    ANTI-ONCOGENE                           13-JUN-94   1OLG              \nTITLE     HIGH-RESOLUTION SOLUTION STRUCTURE OF THE OLIGOMERIZATION             \nTITLE    2 DOMAIN OF P53 BY MULTI-DIMENSIONAL NMR                               \nCOMPND    MOL_ID: 1;                                                            \nCOMPND   2 MOLECULE: TUMOR SUPPRESSOR P53 (OLIGOMERIZATION DOMAIN);             \nCOMPND   3 CHAIN: A, B, C, D;                                                   \nCOMPND   4 ENGINEERED: YES                                                      \nSOURCE    MOL_ID: 1;                                                            \nSOURCE   2 ORGANISM_SCIENTIFIC: HOMO SAPIENS;                                   \nSOURCE   3 ORGANISM_COMMON: HUMAN;                                              \nSOURCE   4 ORGANISM_TAXID: 9606                                                 \nKEYWDS    ANTI-ONCOGENE                                                         \nEXPDTA    SOLUTION NMR      '

We see lots of \n, which signifies a new line. The backslash is known as an escape character, meaning that the n after it does not signify the letter n, but that \n together means a new line.

Now, let's try reading it in as a list.

In [8]:
# Read contents of the file in as a list
f_list = f.readlines()

# Look at the list
f_list
Out[8]:
[]

Wait a minute! I got an empty list! That is because you can only scan through a file object once without "rewinding." To rewind, we use the f.seek() method. This method takes an argument of which byte you want to go to as you are reading the file. To go to the beginning, we do f.seek(0). Let's try again.

In [9]:
# Go to the beginning of the file
f.seek(0)

# Read the contents in as a list
f_list = f.readlines()

# Check out the first 10 entries
f_list[:10]
Out[9]:
['HEADER    ANTI-ONCOGENE                           13-JUN-94   1OLG              \n',
 'TITLE     HIGH-RESOLUTION SOLUTION STRUCTURE OF THE OLIGOMERIZATION             \n',
 'TITLE    2 DOMAIN OF P53 BY MULTI-DIMENSIONAL NMR                               \n',
 'COMPND    MOL_ID: 1;                                                            \n',
 'COMPND   2 MOLECULE: TUMOR SUPPRESSOR P53 (OLIGOMERIZATION DOMAIN);             \n',
 'COMPND   3 CHAIN: A, B, C, D;                                                   \n',
 'COMPND   4 ENGINEERED: YES                                                      \n',
 'SOURCE    MOL_ID: 1;                                                            \n',
 'SOURCE   2 ORGANISM_SCIENTIFIC: HOMO SAPIENS;                                   \n',
 'SOURCE   3 ORGANISM_COMMON: HUMAN;                                              \n']

We see that each entry is a line, including the newline character. To look at lines in files, the rstrip() method for strings can come it handy. It strips all whitespace, including newlines, from the end of a string.

In [11]:
f_list[0].rstrip()
Out[11]:
'HEADER    ANTI-ONCOGENE                           13-JUN-94   1OLG'

Much nicer!

Now, for something very important. Whenever we open a file, we must close it when we are done with it. This is important because unexpected things can happen when the file is still open. We can check to see if it is closed.

In [12]:
f.closed
Out[12]:
False

Ok! Better close it. We just use the f.close() method.

In [13]:
f.close()

# Is it closed?
f.closed
Out[13]:
True

Using context managers with files

Python has a wonderful keyword, with. This keyword enables context management. Upon entry into a with block, variables have certain meaning. Upon exit, certain operations take place. For file objects created by opening them, the file is automatically closed upon exit, even if there is an error. This is important. If your program raises an exception before you have a chance to close the file, it won't get closed and you could be in trouble. If you use context management, the file will still get closed.

Let's see how it works.

In [15]:
with open('data/1OLG.pdb', 'r') as f:
    f_lines = f.readlines()
    print('In the with block, is the file closed?', f.closed)
    
print('Out of the with block, is the file closed?', f.closed)

# Check the first three lines
f_lines[:3]
In the with block, is the file closed? False
Out of the with block, is the file closed? True
Out[15]:
['HEADER    ANTI-ONCOGENE                           13-JUN-94   1OLG              \n',
 'TITLE     HIGH-RESOLUTION SOLUTION STRUCTURE OF THE OLIGOMERIZATION             \n',
 'TITLE    2 DOMAIN OF P53 BY MULTI-DIMENSIONAL NMR                               \n']

The syntax is almost like English. We do what is written in the with block with an open file object that we will name f.

The results look good! In general, you should use context management when working with files. It keeps you out of trouble with open files. And it is much cleaner. This is worth making official:

Use context management using **`with`** when working with files.

Reading line-by-line

What if we do not want to read the entire file into a list? For example, if a file is several gigabytes, we do not want to spend all of our RAM storing a list. Instead, we can read it line-by-line. Conveniently, the file object can be used as an iterator.

In [17]:
# Print the first ten lines of the file
with open('data/1OLG.pdb', 'r') as f:
    counter = 0
    for line in f:
        print(line.rstrip())
        counter += 1
        if counter >= 10:
            break
HEADER    ANTI-ONCOGENE                           13-JUN-94   1OLG
TITLE     HIGH-RESOLUTION SOLUTION STRUCTURE OF THE OLIGOMERIZATION
TITLE    2 DOMAIN OF P53 BY MULTI-DIMENSIONAL NMR
COMPND    MOL_ID: 1;
COMPND   2 MOLECULE: TUMOR SUPPRESSOR P53 (OLIGOMERIZATION DOMAIN);
COMPND   3 CHAIN: A, B, C, D;
COMPND   4 ENGINEERED: YES
SOURCE    MOL_ID: 1;
SOURCE   2 ORGANISM_SCIENTIFIC: HOMO SAPIENS;
SOURCE   3 ORGANISM_COMMON: HUMAN;

Alternatively, we can use the method f.readline() to read a single line in the file and return it as a string.

In [19]:
# Print the first ten lines of the file
with open('data/1OLG.pdb', 'r') as f:
    counter = 0
    while counter < 10:
        print(f.readline().rstrip())
        counter += 1
HEADER    ANTI-ONCOGENE                           13-JUN-94   1OLG
TITLE     HIGH-RESOLUTION SOLUTION STRUCTURE OF THE OLIGOMERIZATION
TITLE    2 DOMAIN OF P53 BY MULTI-DIMENSIONAL NMR
COMPND    MOL_ID: 1;
COMPND   2 MOLECULE: TUMOR SUPPRESSOR P53 (OLIGOMERIZATION DOMAIN);
COMPND   3 CHAIN: A, B, C, D;
COMPND   4 ENGINEERED: YES
SOURCE    MOL_ID: 1;
SOURCE   2 ORGANISM_SCIENTIFIC: HOMO SAPIENS;
SOURCE   3 ORGANISM_COMMON: HUMAN;

Writing to a file

Writing to a file has similar syntax. We already saw how to open a file for writing. Again, context management is useful. However, before trying to open a file, we should check to make sure a file of the same name does not exist before opening it. The os.path module, which we will visit when we learn about scripting later in the bootcamp, is useful. The function os.path.isfile() function checks to see if a file exists.

In [22]:
os.path.isfile('data/1OLG.pdb')
Out[22]:
True

So, now we're ready to open a file to write.

In [23]:
if os.path.isfile('mastery.txt'):
    raise RuntimeError('File mastery.txt already exists.')

with open('mastery.txt', 'w') as f:
    f.write('This is my file.')
    f.write('There are many like it, but this one is mine.')
    f.write('I must master my file like I must master my life.')

Note that we can use the f.write() method to write strings to a file. Let's look at the file contents.

In [24]:
!cat mastery.txt
This is my file.There are many like it, but this one is mine.I must master my file like I must master my life.

Ah! There are no newlines! When writing to a file, unlike when you use the print() function, you must include the newline characters. Let's try again, intentionally obliterating our first attempt.

In [25]:
with open('mastery.txt', 'w') as f:
    f.write('This is my file.\n')
    f.write('There are many like it, but this one is mine.\n')
    f.write('I must master my file like I must master my life.\n')
    
!cat mastery.txt
This is my file.
There are many like it, but this one is mine.
I must master my file like I must master my life.

That's better. Note also that f.write() only takes strings as arguments. You cannot pass numbers. They must be converted to strings first.

In [26]:
# This will result in an exception
with open('gimme_phi.txt', 'w') as f:
    f.write('The golden ratio is φ = ')
    f.write(1.61803398875)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-48bf6cb2f626> in <module>()
      2 with open('gimme_phi.txt', 'w') as f:
      3     f.write('The golden ratio is φ = ')
----> 4     f.write(1.61803398875)

TypeError: write() argument must be str, not float

Yup. It must be a string. Let's try again.

In [27]:
with open('gimme_phi.txt', 'w') as f:
    f.write('The golden ratio is φ = ')
    f.write('{phi:.8f}'.format(phi=1.61803398875))

!cat gimme_phi.txt
The golden ratio is φ = 1.61803399

That works!

An exercise: extract atomic coordinates for first chain in tetramer

As an example on how to do file I/O, we will take the PDB file and extract only the ATOM records for the first chain of the tetramer and write only those entries to a new file.

It is useful to know that according to the PDB format specification, column 21 in the ATOM entry gives the ID of the chain.

We also conveniently use the fact that we can have multiple files open in our with block, separating them with commas.

In [29]:
with open('data/1OLG.pdb', 'r') as f, open('atoms_chain_A.txt', 'w') as f_out:
    # Get all the lines
    lines = f.readlines()

    # Put the ATOM lines from chain A in new file
    for line in lines:
        if len(line) > 21 and line[:4] == 'ATOM' and line[21] == 'A':
            f_out.write(line)

Let's see how we did!

In [34]:
!head -10 atoms_chain_A.txt
ATOM      1  N   LYS A 319      18.634  25.437  10.685  1.00  4.81           N  
ATOM      2  CA  LYS A 319      17.984  25.295   9.354  1.00  4.32           C  
ATOM      3  C   LYS A 319      18.160  23.876   8.818  1.00  3.74           C  
ATOM      4  O   LYS A 319      19.259  23.441   8.537  1.00  3.67           O  
ATOM      5  CB  LYS A 319      18.609  26.282   8.371  1.00  4.67           C  
ATOM      6  CG  LYS A 319      18.003  26.056   6.986  1.00  5.15           C  
ATOM      7  CD  LYS A 319      16.476  26.057   7.091  1.00  5.90           C  
ATOM      8  CE  LYS A 319      16.014  27.341   7.784  1.00  6.51           C  
ATOM      9  NZ  LYS A 319      16.388  28.518   6.952  1.00  7.33           N  
ATOM     10  H1  LYS A 319      18.414  24.606  11.281  1.00  5.09           H  
In [35]:
!tail -10 atoms_chain_A.txt
ATOM    689  HD2 PRO A 359       0.183  25.663  13.542  1.00  4.71           H  
ATOM    690  HD3 PRO A 359       0.246  23.956  13.062  1.00  4.53           H  
ATOM    691  N   GLY A 360      -3.984  26.791  10.832  1.00  5.45           N  
ATOM    692  CA  GLY A 360      -4.489  28.138  10.445  1.00  5.95           C  
ATOM    693  C   GLY A 360      -5.981  28.236  10.765  1.00  6.77           C  
ATOM    694  O   GLY A 360      -6.401  27.621  11.732  1.00  7.24           O  
ATOM    695  OXT GLY A 360      -6.679  28.924  10.039  1.00  7.15           O  
ATOM    696  H   GLY A 360      -4.589  26.020  10.828  1.00  5.72           H  
ATOM    697  HA2 GLY A 360      -3.950  28.896  10.995  1.00  5.99           H  
ATOM    698  HA3 GLY A 360      -4.341  28.288   9.386  1.00  6.05           H  

Nice!