Lesson 36: Introduction to scripting


[1]:
import datetime
import glob
import os
import re
import shutil
import subprocess

Scripting is a way to automate your tasks. =For example, let’s say you are a lawyer and you need to redact a bunch of text documents. =Two possible ways to do this are:

  1. Open each document on your computer. Search for the text “Jeffrey Lebowski.” Wherever you find it, change it to “xxxxxxx.” Save the document.

  2. Tell your assistant to open each document on your computer. Search for the text “Jeffrey Lebowski.” Wherever he finds it, change it to “xxxxxxx.” Save the document.

I suspect that most of you would choose option 2. Scripting is basically doing option 2, except your assistant is your computer.

Because of its simple syntax and modules to parse text and communicate with the operating system in its standard library, Python is a good scripting language. In this tutorial, we will write some Python scripts to do tasks we might encounter in biology. (We already saw some examples of some of the text processing in previous lessons.)

Our example scripting problem

As tends to be our philosophy, we will learn some scripting procedures by example. In the example we consider, we will parse a directory of images coming from a Leica SP2 confocal microscope. These microscopes came out about twenty years ago, and the software is also a bit old. It is very common to use older instruments, especially high end ones like this one, which can cost hundreds of thousands of dollars, in research.

One problem with the old software used for image acquisition on this microscope is that the file names are stored in the following format:

prefix_IMAGENAME_ch00.tif
prefix_SERIESNAME_t00_z000_ch00.tif

Here, prefix is chosen by the user. If we are taking a single image (not a time series or \(z\)-stack), then IMAGENAME is assigned by the software, since you may have multiple images (or series of images) with the same prefix. After ch are two digits indicating which channel is being used (which absorbance/emissions wavelengths are being used).

For series of images, the two digits after the t character indicate the frame number (i.e., time point). The digits after z indicate the z position, and the digits after ch are as in the single image example.

An inherent problem with this naming convention becomes clear when we have more than 100 images in a time series. For example, the 99th, 100th, and 101st images would be

prefix_SERIESNAME_t98_z000_ch00.tif
prefix_SERIESNAME_t99_z000_ch00.tif
prefix_SERIESNAME_t100_z000_ch00.tif

So, if we were to put all of our images in alphabetical order, which is how many image processing software packages read in images, the order would be

prefix_SERIESNAME_t00_z000_ch00.tif
prefix_SERIESNAME_t01_z000_ch00.tif
                 ⋮
prefix_SERIESNAME_t09_z000_ch00.tif
prefix_SERIESNAME_t100_z000_ch00.tif
prefix_SERIESNAME_t101_z000_ch00.tif
                 ⋮

The frames are now out of order! We should pad the frame number in the file name with more zeros.

You can access a sample data set in the ~git/bootcamp/data/leica_tiffs/ folder. This data set contains an actual set of images I obtained from a Leica SP2. For this data set, prefix = stage9, referring to the stage of development of the cells in the images. Our goals in this lesson are to:

  1. Rename the files so that they can be read cleanly in alphabetical order.

  2. Parse the metadata (found in the file stage9.txt) to find the interpixel distance for these images.

  3. Parse the metadata to find the time points for each image.

We’ll start with task 1.

Renaming the files

You may remember from our lesson on string formatting that if we want leading zeros, we use the %04d format string for, e.g., an integer for a total of four digits.

[2]:
print('This digit has {0:04d} digits.'.format(25))
This digit has 0025 digits.

To make sure we don’t run into any problems, we’ll rename the time points to have eight total digits. Our strategy is to get a list of all files in the directory. We’ll then generate a list of names that the files should have. We’ll then instruct the operating system, via the mv command, to rename the files.

Now, before scripting to rename, move, or otherwise process any of your data, always make a backup copy stashed somewhere out of the way in case you make a mistake. Let me reiterate that.

Always have a backup of your data before parsing with scripts..

Getting a list of the directory contents: the os and glob modules

The os module from the standard library has lots of great tools for working with files and talking to the OS. We’ll generate a list of files in our Leica directory, demonstrating some of the useful features of the os module.

[3]:
# Specify Leica directory
leica_dir = 'data/leica_tiffs'

# Get a list of all files in the directory
file_list = os.listdir(leica_dir)

# Look at the list (just first few entries)
file_list[:10]
[3]:
['stage9_Series006_t121_z000_ch00.tif',
 'stage9_Series006_t52_z000_ch00.tif',
 'stage9_Series006_t118_z000_ch00.tif',
 'stage9_Series006_t45_z000_ch00.tif',
 'stage9_Series006_t78_z000_ch00.tif',
 'stage9_Series006_t41_z000_ch00.tif',
 'stage9_Series006_t56_z000_ch00.tif',
 'stage9_Series006_t88_z000_ch00.tif',
 'stage9_Series006_t35_z000_ch00.tif',
 'stage9_Series006_t22_z000_ch00.tif']

The os.listdir() function lists all of the files in a directory (like the ls function) and stores them as a list. We see we have our metadata files, stage9.txt, a single image, stage9_Image003_ch00.tif, and then a lot of files in a time series (called Series006) that appears to go from frame 00 to frame 124. It is these times series files that we want to remame.

While we have a list of the files, we would like a list of the full path of the files. We can use the os.path.join() method to conveniently do that. It joins the strings of the paths, putting /’s where appropriate.

[4]:
# Get list of all files (full path) we want to rename
rename_list = []
for fname in file_list:
    if '_t' in fname:
        rename_list.append(os.path.join(leica_dir, fname))

# Let's look at the list
rename_list[:10]
[4]:
['data/leica_tiffs/stage9_Series006_t121_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t52_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t118_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t45_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t78_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t41_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t56_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t88_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t35_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t22_z000_ch00.tif']

Now, we have a list of only the files we want to rename. We could have done this more concisely using the glob module, which allows generating lists of files with wild card characters.

[5]:
# String to look for
search_str = os.path.join(leica_dir, '*_t*.tif')

# Get all files that match search_str
rename_list = glob.glob(search_str)

rename_list[:10]
[5]:
['data/leica_tiffs/stage9_Series006_t121_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t52_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t118_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t45_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t78_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t41_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t56_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t88_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t35_z000_ch00.tif',
 'data/leica_tiffs/stage9_Series006_t22_z000_ch00.tif']

Renaming files using the shutil module

We will now rename the files. There are several ways to do this using modules that are part of the standard library. Here, we’ll show how to do it using the shutil module. This module contains several functions for file management.

[6]:
# String to search for (the time stamp)
regex = re.compile('_t\d+_')

# Rename the files to have 8 total digits in the time field
for fname in rename_list:
    # Pull out _t00_ string
    time_str = regex.search(fname).group()

    # Make a new, 8-digit string
    new_time_str = '_t{0:08d}_'.format(int(time_str[2:-1]))

    # Make a new file name; could do: new_fname = regex.sub(new_time_str, fname)
    new_fname = fname.replace(time_str, new_time_str)

    # Make a call to shutil.move to rename files
    shutil.move(fname, new_fname)

You can look in your directory now to see all of your beautifully renamed files!

The subprocess module

The subprocess module of the standard library is useful for running commands on the command line. This is often used in bioinformatics pipelines, for example, when you need to run various programs from the command line to process data sets. We don’t really need it for the bootcamp, but I bring it up here because it may come in handy for you down the road. If we wanted to do the same operations as above, except with the subprocess module, we would replace the shutil.move(fname, new_fname) with

subprocess.call(['mv', fname, new_fname])

The subprocess.call() function allows you to execute command line commands as subprocesses, also allowing parallel processing. The first argument is a comma separated list of the strings you would enter in the command line to do your operation. For example, if fname = 'file1' and new_fname = 'file2', the above is the same as entering

mv file1 file2

on the command line.

As an example of how the subprocess.call() function works, we’ll list the contents of the Leica directory and redirect the output to a file lieca_dir_contents.txt.

[7]:
with open('leica_dir_contents.txt', 'w') as f:
    subprocess.call(['ls', '-l', leica_dir], stdout=f)

Looking at leica_dir_contents.txt, we see that it did what we expected.

[8]:
!head leica_dir_contents.txt
total 29760
-rw-r--r--@ 1 bois  staff    13457 Jun 25 00:42 stage9.txt
-rw-r--r--@ 1 bois  staff  1051510 Jun 25 00:42 stage9_Image003_ch00.tif
-rw-r--r--@ 1 bois  staff   113062 Jun 25 00:42 stage9_Series006_t00000000_z000_ch00.tif
-rw-r--r--@ 1 bois  staff   113062 Jun 25 00:42 stage9_Series006_t00000001_z000_ch00.tif
-rw-r--r--@ 1 bois  staff   113062 Jun 25 00:42 stage9_Series006_t00000002_z000_ch00.tif
-rw-r--r--@ 1 bois  staff   113062 Jun 25 00:42 stage9_Series006_t00000003_z000_ch00.tif
-rw-r--r--@ 1 bois  staff   113062 Jun 25 00:42 stage9_Series006_t00000004_z000_ch00.tif
-rw-r--r--@ 1 bois  staff   113062 Jun 25 00:42 stage9_Series006_t00000005_z000_ch00.tif
-rw-r--r--@ 1 bois  staff   113062 Jun 25 00:42 stage9_Series006_t00000006_z000_ch00.tif

Parsing the metadata

Let’s look at the metadata file that came with our image series.

[9]:
!head data/leica_tiffs/stage9.txt
Leica Microsystems Heidelberg GmbHThis file is intended for read-only purposes changes here will not affect the images.Date:    Thursday, May 30, 2013Time:    17:38File Version:    26000000EXPERIMENT INFORMATIONNumber of Images: 2Type:           Series with 'tif'-files 

You should look at this with whatever text editor your like so you can see all of the contents. It is a text file with lots of information about the images that were acquired. Specifically, we want the information about Scan006. So, we want to scan the file until we get to the line that says

Series Name:    Series006

The entries immediately after that will give the information about this series. First, let’s just read in all of the lines of the file.

[10]:
with open(os.path.join(leica_dir, 'stage9.txt'), 'r') as f:
    lines = f.readlines()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-10-d3612969a3c9> in <module>
      1 with open(os.path.join(leica_dir, 'stage9.txt'), 'r') as f:
----> 2     lines = f.readlines()

~/opt/anaconda3/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 3378: invalid start byte

Yikes! What is going on here? It turns out that these Leica files are not UTF-8 encoded, which means there are strange characters in there that we cannot recognize. The encoding is ISO-8859-15, a pre UTF-8 encoding. Very annoying, but something we commonly encounter with various instruments. We just have to specify the encoding when we open the file to be able to read it.

[11]:
with open(
    os.path.join(leica_dir, "stage9.txt"), "r", encoding="ISO-8859-15"
) as f:
    lines = f.readlines()

Ok, that worked. Let’s take a look at the file lines we pulled out (just the first few so as not to pollute the screen).

[12]:
lines[:10]
[12]:
['Leica Microsystems Heidelberg GmbH\x00\n',
 'This file is intended for read-only purposes changes here will not affect the images.\x00\n',
 'Date:\t Thursday, May 30, 2013\x00\n',
 'Time:\t 17:38\x00\n',
 '\x00\n',
 'File Version:\t 26000000\x00\n',
 '\x00\n',
 'EXPERIMENT INFORMATION\x00\n',
 'Number of Images: 2\x00\n',
 "Type:          \tSeries with 'tif'-files \x00\n"]

You’ll notice the strange \x00\n at the end of each line. This, again, is an old encoding for carriage returns. It it not at all uncommon that you will have to deal with such annoyances when parsing files. They are of no concern to us, since we will just strip them off. While we’re at it, we’ll also split the line up at white spaces using the split() method of strings.

[13]:
for i, line in enumerate(lines):
    lines[i] = line.rstrip('\x00\n').split()

Now, we’ll search through the list of strings until we find the

Series Name:    Series006

line. This just means that we want to find the index that has entry

['Series', 'Name:', 'Series006']

We can use the index() method of lists to do this.

[14]:
# Find index where Series is identified
i_start = lines.index(['Series', 'Name:', 'Series006'])

# Show us!
i_start
[14]:
198

Extracting interpixel distances

Now, we can scan starting that this starting index and extract the voxel sizes.

[15]:
for i in range(i_start, len(lines)):
    if len(lines[i]) > 0:
        if lines[i][0] == 'Voxel-Width':
            physical_size_x = lines[i][2]
        elif lines[i][0] == 'Voxel-Height':
            physical_size_y = lines[i][2]

# Pixel sizes!
print('Interpixel spacing in x (µm):', physical_size_x)
print('Interpixel spacing in y (µm):', physical_size_y)
Interpixel spacing in x (µm): 0.310171
Interpixel spacing in y (µm): 0.310171

Extracting the time points

Now, we want to get the time points for our images. The time stamp records look line this:

Stamp_0:        2013:05:30,16:48:30:515

So, we want to find the stamps, set the time for frame 0 to be zero seconds, and then compute the time difference for all of the other frames. We can so this using the datetime module.

Our strategy is to convert the time stamp string into a datetime.datetime object and the subtract the first time from the others. First, we’ll write a function to convert time strings to a list of numbers.

[16]:
def time_str_to_datetime(time_str):
    """
    Convert date/time string in Leica file to
    datetime.datetime object
    """

    # Split at colons and comma
    splitter = re.compile('[,|:]')

    # Split the time_str
    str_list = splitter.split(time_str)

    # Convert to integers
    for i, num in enumerate(str_list):
        str_list[i] = int(num)

    # Return datetime.datetime object
    return datetime.datetime(*tuple(str_list))

Now that we have this function, we can scan through and find the time stamps.

[17]:
# Scan until we get to the first time stamp
i = i_start
while len(lines[i]) == 0 or lines[i][0] != 'Stamp_0:':
    i += 1

# Extract the time string
t_0 = time_str_to_datetime(lines[i][1])

# Loop through successive frames and get the time points
time_points = [0]
i += 1
while len(lines[i]) > 1 and lines[i][0][:5] == 'Stamp':
    delta_t = time_str_to_datetime(lines[i][1]) - t_0
    time_points.append(delta_t.total_seconds())
    i += 1

Let’s look at our list and see how we did.

[18]:
time_points[:10]
[18]:
[0,
 0.000406,
 0.999797,
 1.000203,
 1.999594,
 2.0,
 2.000406,
 2.999797,
 3.000203,
 3.99961]

Great! We managed to extract the important metadata we could then use in our image analysis. We also now have nicely organized files that appear in alphabetical order.

Conclusions

In this example, we showed how you can use Python to clean up data sets that come off of instrument, as well as extract useful information out of the resulting files. If we have many sets of images like this, we should develop these parsing procedures into a function that will automatically parse any directory of data you present. This can help get you ready for further analysis, with everything in a pre-defined format.

Computing environment

[19]:
%load_ext watermark
%watermark -v -p jupyterlab
CPython 3.7.7
IPython 7.15.0

jupyterlab 2.1.4