Lesson 35: Introduction to scripting
[1]:
import datetime
import glob
import os
import re
import shutil
import subprocess
Scripting is a way to automate your tasks. =For example, let’s say you are a lawyer and you need to redact a bunch of text documents. =Two possible ways to do this are:
Open each document on your computer. Search for the text “Jeffrey Lebowski.” Wherever you find it, change it to “xxxxxxx.” Save the document.
Tell your assistant to open each document on your computer. Search for the text “Jeffrey Lebowski.” Wherever he finds it, change it to “xxxxxxx.” Save the document.
I suspect that most of you would choose option 2. Scripting is basically doing option 2, except your assistant is your computer.
Because of its simple syntax and modules to parse text and communicate with the operating system in its standard library, Python is a good scripting language. In this tutorial, we will write some Python scripts to do tasks we might encounter in biology. (We already saw some examples of some of the text processing in previous lessons.)
Our example scripting problem
As tends to be our philosophy, we will learn some scripting procedures by example. In the example we consider, we will parse a directory of images coming from a Leica SP2 confocal microscope. These microscopes came out about twenty years ago, and the software is also a bit old. It is very common to use older instruments, especially high end ones like this one, which can cost hundreds of thousands of dollars, in research.
One problem with the old software used for image acquisition on this microscope is that the file names are stored in the following format:
prefix_IMAGENAME_ch00.tif
prefix_SERIESNAME_t00_z000_ch00.tif
Here, prefix is chosen by the user. If we are taking a single image (not a time series or \(z\)-stack), then IMAGENAME
is assigned by the software, since you may have multiple images (or series of images) with the same prefix. After ch
are two digits indicating which channel is being used (which absorbance/emissions wavelengths are being used).
For series of images, the two digits after the t
character indicate the frame number (i.e., time point). The digits after z
indicate the z
position, and the digits after ch
are as in the single image example.
An inherent problem with this naming convention becomes clear when we have more than 100 images in a time series. For example, the 99th, 100th, and 101st images would be
prefix_SERIESNAME_t98_z000_ch00.tif
prefix_SERIESNAME_t99_z000_ch00.tif
prefix_SERIESNAME_t100_z000_ch00.tif
So, if we were to put all of our images in alphabetical order, which is how many image processing software packages read in images, the order would be
prefix_SERIESNAME_t00_z000_ch00.tif
prefix_SERIESNAME_t01_z000_ch00.tif
⋮
prefix_SERIESNAME_t09_z000_ch00.tif
prefix_SERIESNAME_t100_z000_ch00.tif
prefix_SERIESNAME_t101_z000_ch00.tif
⋮
The frames are now out of order! We should pad the frame number in the file name with more zeros.
You can access a sample data set in the ~git/bootcamp/data/leica_tiffs/
folder. This data set contains an actual set of images I obtained from a Leica SP2. For this data set, prefix = stage9
, referring to the stage of development of the cells in the images. Our goals in this lesson are to:
Rename the files so that they can be read cleanly in alphabetical order.
Parse the metadata (found in the file
stage9.txt
) to find the interpixel distance for these images.Parse the metadata to find the time points for each image.
We’ll start with task 1.
Renaming the files
You may remember from our lesson on string formatting that if we want leading zeros, we use the %04d
format string for, e.g., an integer for a total of four digits.
[2]:
print('This digit has {0:04d} digits.'.format(25))
This digit has 0025 digits.
To make sure we don’t run into any problems, we’ll rename the time points to have eight total digits. Our strategy is to get a list of all files in the directory. We’ll then generate a list of names that the files should have. We’ll then instruct the operating system, via the mv
command, to rename the files.
When we rename them, we will have a separate directory to contain the renamed files. Why? Because of this wisdom:
Never delete or modify original data.
Data that comes off of an instrument should never be modified. Rather, you should make copies of it and proceed through an analysis pipeline.
To make a new directory from the command line, we could use mkdir -p data/leica_tiffs_renamed
(the -p
flag means that it will not throw an error if the directory already exists). We can do this in Python using os.makedirs
as follows.
[3]:
new_path = 'data/leica_tiffs_renamed'
os.makedirs(new_path, exist_ok=True)
Getting a list of the directory contents: the os and glob modules
The os
module from the standard library has lots of great tools for working with files and talking to the OS. We’ll generate a list of files in our Leica directory, demonstrating some of the useful features of the os
module.
[4]:
# Specify Leica directory
leica_dir = 'data/leica_tiffs'
# Get a list of all files in the directory
file_list = os.listdir(leica_dir)
# Look at the list (just first few entries)
file_list[:10]
[4]:
['stage9_Series006_t121_z000_ch00.tif',
'stage9_Series006_t52_z000_ch00.tif',
'stage9_Series006_t118_z000_ch00.tif',
'stage9_Series006_t45_z000_ch00.tif',
'stage9_Series006_t78_z000_ch00.tif',
'stage9_Series006_t41_z000_ch00.tif',
'stage9_Series006_t56_z000_ch00.tif',
'stage9_Series006_t88_z000_ch00.tif',
'stage9_Series006_t35_z000_ch00.tif',
'stage9_Series006_t22_z000_ch00.tif']
The os.listdir()
function lists all of the files in a directory (like the ls
function) and stores them as a list. We see we have our metadata files, stage9.txt
, a single image, stage9_Image003_ch00.tif
, and then a lot of files in a time series (called Series006
) that appears to go from frame 00
to frame 124
. It is these times series files that we want to remame.
While we have a list of the files, we would like a list of the full path of the files. We can use the os.path.join()
method to conveniently do that. It joins the strings of the paths, putting /
’s where appropriate.
[5]:
# Get list of all files (full path) we want to rename
rename_list = []
for fname in file_list:
if '_t' in fname:
rename_list.append(os.path.join(leica_dir, fname))
# Let's look at the list
rename_list[:10]
[5]:
['data/leica_tiffs/stage9_Series006_t121_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t52_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t118_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t45_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t78_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t41_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t56_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t88_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t35_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t22_z000_ch00.tif']
Now, we have a list of only the files we want to rename. We could have done this more concisely using the glob
module, which allows generating lists of files with wild card characters.
[6]:
# String to look for
search_str = os.path.join(leica_dir, '*_t*.tif')
# Get all files that match search_str
rename_list = glob.glob(search_str)
rename_list[:10]
[6]:
['data/leica_tiffs/stage9_Series006_t121_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t52_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t118_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t45_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t78_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t41_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t56_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t88_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t35_z000_ch00.tif',
'data/leica_tiffs/stage9_Series006_t22_z000_ch00.tif']
Renaming files using the shutil module
We will now rename the files. There are several ways to do this using modules that are part of the standard library. Here, we’ll show how to do it using the shutil module. This module contains several functions for file management.
[7]:
# String to search for (the time stamp)
regex = re.compile("_t\d+_")
# Rename the files to have 8 total digits in the time field
for fname in rename_list:
# Pull out _t00_ string
time_str = regex.search(fname).group()
# Make a new, 8-digit string
new_time_str = "_t{0:08d}_".format(int(time_str[2:-1]))
# Make a new file name; could do: new_fname = regex.sub(new_time_str, fname)
new_fname = fname.replace(time_str, new_time_str).replace(
"leica_tiffs", "leica_tiffs_renamed"
)
# Make a call to shutil.move to rename files
shutil.copyfile(fname, new_fname)
You can look in your directory now to see all of your beautifully renamed files!
We should also copy the metadata over for convenience. It is generally a good idea to keep the metadata together with the data sets, images or otherwise, that you will be analyzing.
[8]:
shutil.copyfile(
'data/leica_tiffs/stage9.txt', 'data/leica_tiffs_renamed/stage9.txt'
);
The subprocess module
The subprocess module of the standard library is useful for running commands on the command line. This is often used in bioinformatics pipelines, for example, when you need to run various programs from the command line to process data sets. We don’t really need it for the bootcamp, but I bring it up here because it may come in handy for you down the road. If we wanted to do the same operations as above, except with the subprocess
module, we would replace the shutil.move(fname, new_fname)
with
subprocess.call(['cp', fname, new_fname])
The subprocess.call()
function allows you to execute command line commands as subprocesses, also allowing parallel processing. The first argument is a comma separated list of the strings you would enter in the command line to do your operation. For example, if fname = 'file1'
and new_fname = 'file2'
, the above is the same as entering
cp file1 file2
on the command line.
As an example of how the subprocess.call()
function works, we’ll list the contents of the Leica directory and redirect the output to a file lieca_dir_contents.txt
.
[9]:
with open('leica_dir_contents.txt', 'w') as f:
subprocess.call(['ls', '-l', leica_dir], stdout=f)
Looking at leica_dir_contents.txt
, we see that it did what we expected.
[10]:
!head leica_dir_contents.txt
total 29640
-rw-r--r--@ 1 bois staff 13457 Jun 6 15:26 stage9.txt
-rw-r--r--@ 1 bois staff 1051510 Jun 6 15:26 stage9_Image003_ch00.tif
-rw-r--r--@ 1 bois staff 113062 Jun 15 21:58 stage9_Series006_t00_z000_ch00.tif
-rw-r--r--@ 1 bois staff 113062 Jun 15 21:58 stage9_Series006_t01_z000_ch00.tif
-rw-r--r--@ 1 bois staff 113062 Jun 15 21:58 stage9_Series006_t02_z000_ch00.tif
-rw-r--r--@ 1 bois staff 113062 Jun 15 21:58 stage9_Series006_t03_z000_ch00.tif
-rw-r--r--@ 1 bois staff 113062 Jun 15 21:58 stage9_Series006_t04_z000_ch00.tif
-rw-r--r--@ 1 bois staff 113062 Jun 15 21:58 stage9_Series006_t05_z000_ch00.tif
-rw-r--r--@ 1 bois staff 113062 Jun 15 21:58 stage9_Series006_t06_z000_ch00.tif
Parsing the metadata
Let’s look at the metadata file that came with our image series.
[11]:
!head data/leica_tiffs/stage9.txt
Leica Microsystems Heidelberg GmbHThis file is intended for read-only purposes changes here will not affect the images.Date: Thursday, May 30, 2013Time: 17:38File Version: 26000000EXPERIMENT INFORMATIONNumber of Images: 2Type: Series with 'tif'-files
You should look at this with whatever text editor your like so you can see all of the contents. It is a text file with lots of information about the images that were acquired. Specifically, we want the information about Scan006
. So, we want to scan the file until we get to the line that says
Series Name: Series006
The entries immediately after that will give the information about this series. First, let’s just read in all of the lines of the file.
[12]:
with open(os.path.join(leica_dir, 'stage9.txt'), 'r') as f:
lines = f.readlines()
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-12-d3612969a3c9> in <module>
1 with open(os.path.join(leica_dir, 'stage9.txt'), 'r') as f:
----> 2 lines = f.readlines()
~/opt/anaconda3/lib/python3.8/codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 3378: invalid start byte
Yikes! What is going on here? It turns out that these Leica files are not UTF-8 encoded, which means there are strange characters in there that we cannot recognize. The encoding is ISO-8859-15, a pre UTF-8 encoding. Very annoying, but something we commonly encounter with various instruments. We just have to specify the encoding when we open the file to be able to read it.
[13]:
with open(
os.path.join(leica_dir, "stage9.txt"), "r", encoding="ISO-8859-15"
) as f:
lines = f.readlines()
Ok, that worked. Let’s take a look at the file lines we pulled out (just the first few so as not to pollute the screen).
[14]:
lines[:10]
[14]:
['Leica Microsystems Heidelberg GmbH\x00\n',
'This file is intended for read-only purposes changes here will not affect the images.\x00\n',
'Date:\t Thursday, May 30, 2013\x00\n',
'Time:\t 17:38\x00\n',
'\x00\n',
'File Version:\t 26000000\x00\n',
'\x00\n',
'EXPERIMENT INFORMATION\x00\n',
'Number of Images: 2\x00\n',
"Type: \tSeries with 'tif'-files \x00\n"]
You’ll notice the strange \x00\n
at the end of each line. This, again, is an old encoding for carriage returns. It it not at all uncommon that you will have to deal with such annoyances when parsing files. They are of no concern to us, since we will just strip them off. While we’re at it, we’ll also split the line up at white spaces using the split()
method of strings.
[15]:
for i, line in enumerate(lines):
lines[i] = line.rstrip('\x00\n').split()
Now, we’ll search through the list of strings until we find the
Series Name: Series006
line. This just means that we want to find the index that has entry
['Series', 'Name:', 'Series006']
We can use the index()
method of lists to do this.
[16]:
# Find index where Series is identified
i_start = lines.index(['Series', 'Name:', 'Series006'])
# Show us!
i_start
[16]:
198
Extracting interpixel distances
Now, we can scan starting that this starting index and extract the voxel sizes.
[17]:
for i in range(i_start, len(lines)):
if len(lines[i]) > 0:
if lines[i][0] == 'Voxel-Width':
physical_size_x = lines[i][2]
elif lines[i][0] == 'Voxel-Height':
physical_size_y = lines[i][2]
# Pixel sizes!
print('Interpixel spacing in x (µm):', physical_size_x)
print('Interpixel spacing in y (µm):', physical_size_y)
Interpixel spacing in x (µm): 0.310171
Interpixel spacing in y (µm): 0.310171
Extracting the time points
Now, we want to get the time points for our images. The time stamp records look line this:
Stamp_0: 2013:05:30,16:48:30:515
So, we want to find the stamps, set the time for frame 0 to be zero seconds, and then compute the time difference for all of the other frames. We can so this using the datetime
module.
Our strategy is to convert the time stamp string into a datetime.datetime
object and the subtract the first time from the others. First, we’ll write a function to convert time strings to a list of numbers.
[18]:
def time_str_to_datetime(time_str):
"""
Convert date/time string in Leica file to
datetime.datetime object
"""
# Split at colons and comma
splitter = re.compile('[,|:]')
# Split the time_str
str_list = splitter.split(time_str)
# Convert to integers
for i, num in enumerate(str_list):
str_list[i] = int(num)
# Return datetime.datetime object
return datetime.datetime(*tuple(str_list))
Now that we have this function, we can scan through and find the time stamps.
[19]:
# Scan until we get to the first time stamp
i = i_start
while len(lines[i]) == 0 or lines[i][0] != 'Stamp_0:':
i += 1
# Extract the time string
t_0 = time_str_to_datetime(lines[i][1])
# Loop through successive frames and get the time points
time_points = [0]
i += 1
while len(lines[i]) > 1 and lines[i][0][:5] == 'Stamp':
delta_t = time_str_to_datetime(lines[i][1]) - t_0
time_points.append(delta_t.total_seconds())
i += 1
Let’s look at our list and see how we did.
[20]:
time_points[:10]
[20]:
[0,
0.000406,
0.999797,
1.000203,
1.999594,
2.0,
2.000406,
2.999797,
3.000203,
3.99961]
Great! We managed to extract the important metadata we could then use in our image analysis. We also now have nicely organized files that appear in alphabetical order.
Conclusions
In this example, we showed how you can use Python to clean up data sets that come off of instrument, as well as extract useful information out of the resulting files. If we have many sets of images like this, we should develop these parsing procedures into a function that will automatically parse any directory of data you present. This can help get you ready for further analysis, with everything in a pre-defined format.
Computing environment
[21]:
%load_ext watermark
%watermark -v -p jupyterlab
Python implementation: CPython
Python version : 3.8.10
IPython version : 7.22.0
jupyterlab: 3.0.14