This tutorial was generated from a Jupyter notebook. You can download the notebook here.
# We just need modules from the standard library
import datetime
import glob
import os
import re
import shutil
import subprocess
Scripting is a way to automate your tasks. For example, let's say you are a lawyer and you need to redact a bunch of text documents. Two possible ways to do this are:
I suspect that most of you would choose option 2. Scripting is basically doing option 2, except your assistant is your computer.
Because of its simple syntax and modules to parse text and communicate with the operating system in its standard library, Python is a good scripting language. In this tutorial, we will write some Python scripts to do tasks we might encounter in biology. (We already saw some examples of some of the text processing in previous tutorials.)
As tends to be our philosophy, we will learn some scripting procedures by example. In the example we consider, we will parse a directory of images coming from a Leica SP2 confocal microscope. These microscopes came out about fifteen years ago, and the software is also a bit old. It is very common to use older instruments, especially high end ones like this one, which can cost hundreds of thousands of dollars, in research.
One problem with the old software used for image acquisition on this microscope is that the file names are stored in the following format:
prefix_IMAGENAME_ch00.tif
prefix_SERIESNAME_t00_z000_ch00.tif
Here, prefix is chosed by the user. If we are taking a single image (not a time series of $z$-stack), then IMAGENAME
is assigned by the software, since you may have multiple images (or series of images) with the same prefix. After ch
are two digits indicating which channel is being used (which absorbance/emissions wavelengths are being used).
For series of images, the two digits after the t
character indicate the frame number (i.e., time point). The digits after z
indicate the z
position, and the digits after ch
are as in the single image example.
An inherent problem with this naming convention becomes clear when we have more than 100 images in a time series. For example, the 99th, 100th, and 101st images would be
prefix_SERIESNAME_t98_z000_ch00.tif
prefix_SERIESNAME_t99_z000_ch00.tif
prefix_SERIESNAME_t100_z000_ch00.tif
So, if we were to put all of our images in alphabetical order, which is how many image processing software packages read in images, the order would be
prefix_SERIESNAME_t00_z000_ch00.tif
prefix_SERIESNAME_t01_z000_ch00.tif
⋮
prefix_SERIESNAME_t09_z000_ch00.tif
prefix_SERIESNAME_t100_z000_ch00.tif
prefix_SERIESNAME_t101_z000_ch00.tif
⋮
The frames are now out of order! We should pad the frame number in the file name with more zeros.
We have a sample data set that you can download here. This data set contains an actual set of images I obtained from a Leica SP2. Our goals in this lesson are to: For this data set, prefix = stage9
, refering to the stage of development of the cells in the images.
stage9.txt
) to find the interpixel distance for these images.We'll start with task 1.
You may remember from our string tutorial that if we want leading zeros, we use the %04d
format string for, e.g., an integer for a total of four digits.
print('This digit has {0:04d} digits.'.format(25))
To make sure we don't run into any problems, we'll rename the time points to have eight total digits. Our strategy is to get a list of all files in the directory. We'll then generate a list of names that the files should have. We'll then instruct the operating system, via the mv
command, to rename the files.
Now, before scripting to rename, move, or otherwise process any of your data, always make a backup copy stached somewhere out of the way in case you make a mistake. Let me reiterate that.
os
and glob
modules¶The os
module from the standard library has lots of great tools for working with files and talking to the OS. We'll generate a list of files in our Leica directory, demonstrating some of the useful features of the os
module.
# Specify Leica directory
leica_dir = '../data/leica_tiffs'
# Get a list of all files in the directory
file_list = os.listdir(leica_dir)
# Look at the list
file_list
The os.listdir()
function lists all of the files in a directory (like the ls
function) and stores them as a list. We see we have our metadata files, stage9.txt
, a single image, stage9_Image003_ch00.tif
, and then a lot of files in a time series (called Series006
) that appears to go from frame 00
to frame 124
. It is these times series files that we want to remame.
While we have a list of the files, we would like a list of the full path of the files. We can use the os.path.join()
method to conveniently do that. It joins the strings of the paths, putting /
's where appropriate.
# Get list of all files (full path) we want to rename
rename_list = []
for fname in file_list:
if '_t' in fname:
rename_list.append(os.path.join(leica_dir, fname))
# Let's look at the list
rename_list
Now, we have a list of only the files we want to rename. We could have done this more concisely using the glob
module, which allows generating lists of files with wild card characters. (The name glob
has an archaic history.)
# String to look for
search_str = os.path.join(leica_dir, '*_t*.tif')
# Get all files that match search_str
rename_list = glob.glob(search_str)
rename_list
shutil
module¶We will now rename the files. There are several ways to do this using modules that are part of the standard library. Here, we'll show how to do it using the shutil
module. This module contains several functions for file management.
# String to search for (the time stamp)
regex = re.compile('_t\d+_')
# Rename the files to have 8 total digits in the time field
for fname in rename_list:
# Pull out _t00_ string
time_str = regex.search(fname).group()
# Make a new, 8-digit string
new_time_str = '_t{0:08d}_'.format(int(time_str[2:-1]))
# Make a new file name; could do: new_fname = regex.sub(new_time_str, fname)
new_fname = fname.replace(time_str, new_time_str)
# Make a call to shutil.move to rename files
shutil.move(fname, new_fname)
You can look in your directory now to see all of your beautifully renamed files!
subprocess
module¶The subprocess
module of the standard library is useful for running commands on the command line. This is often used in bioinformatics pipelines, for example, when you need to run various programs from the command line to process data sets. We don't really need it for the bootcamp, but I bring it up here because it may come in handy for you down the road. If we wanted to do the same operations as above, except with the subprocess
module, we would replace the shutil.move(fname, new_fname)
with
subprocess.call(['mv', fname, new_fname])
The subprocess.call()
function allows you to execute command line commands as subprocesses, also allowing parallel processing. The first argument is a comma separated list of the strings you would enter in the command line to do your operation. For example, if fname = 'file1'
and new_fname = 'file2'
, the above is the same as entering
mv file1 file2
on the command line.
As an example of how the subprocess.call()
function works, we'll list the contents of the Leica directory and redirect the output to a file lieca_dir_contents.txt
.
with open('leica_dir_contents.txt', 'w') as f:
subprocess.call(['ls', '-l', leica_dir], stdout=f)
Looking at leica_dir_contents.txt
, we see that it did what we expected.
!head leica_dir_contents.txt
Let's look at the metadata file that came with our image series.
!head ../data/leica_tiffs/stage9.txt
You should look at this with whatever text editor your like so you can see all of the contents. It is a text file with lots of information about the images that were acquired. Specifically, we want the information about Scan006
. So, we want to scan the file until we get to the line that says
Series Name: Series006
The entries immediately after that will give the information about this series. First, let's just read in all of the lines of the file.
with open(os.path.join(leica_dir, 'stage9.txt'), 'r') as f:
lines = f.readlines()
Yikes! What is going on here? It turns out that these Leica files are not UTF-8 encoded, which means there are strange characters in there that we cannot recognize. The encoding is ISO-8859-15, a pre UTF-8 encoding. Very annoying. We just have to specify the encoding when we open the file to be able to read it.
with open(os.path.join(leica_dir, 'stage9.txt'), 'r',
encoding='ISO-8859-15') as f:
lines = f.readlines()
Ok, that worked. Let's take a look at the file lines we pulled out.
lines
You'll notice the strange \x00\n
at the end of each line. This, again, is an old encoding for carriage returns. It it not at all uncommon that you will have to deal with such annoyances when parsing files. They are of no concern to us, since we will just strip them off. While we're at it, we'll also split the line up at white spaces useing the split()
method of strings.
for i, line in enumerate(lines):
lines[i] = line.rstrip('\x00\n').split()
Now, we'll search through the list of strings until we find the
Series Name: Series006
line. This just means that we want to find the index that has entry
['Series', 'Name:', 'Series006']
We can use the index()
method of lists to do this.
# Find index where Series is identified
i_start = lines.index(['Series', 'Name:', 'Series006'])
# Show us!
i_start
Now, we can scan starting that this starting index and extract the voxel sizes.
for i in range(i_start, len(lines)):
if len(lines[i]) > 0:
if lines[i][0] == 'Voxel-Width':
physical_size_x = lines[i][2]
elif lines[i][0] == 'Voxel-Height':
physical_size_y = lines[i][2]
# Pixel sizes!
print('Interpixel spacing in x (µm):', physical_size_x)
print('Interpixel spacing in y (µm):', physical_size_y)
Now, we want to get the time points for our images. The time stamp records look line this:
Stamp_0: 2013:05:30,16:48:30:515
So, we want to find the stamps, set the time for frame 0 to be zero seconds, and then compute the time difference for all of the other frames. We learned how to do this with the datetime
module.
Our strategy is to convert the time stamp string into a datetime.datetime
object and the subtract the first time from the others. First, we'll write a function to convert time strings to a list of numbers.
def time_str_to_datetime(time_str):
"""
Convert date/time string in Leica file to
datetime.datetime object
"""
# Split at colons and comma
splitter = re.compile('[,|:]')
# Split the time_str
str_list = splitter.split(time_str)
# Convert to integers
for i, num in enumerate(str_list):
str_list[i] = int(num)
# Return datetime.datetime object
return datetime.datetime(*tuple(str_list))
Now that we have this function, we can scan through and find the time stamps.
# Scan until we get to the first time stamp
i = i_start
while len(lines[i]) == 0 or lines[i][0] != 'Stamp_0:':
i += 1
# Extract the time string
t_0 = time_str_to_datetime(lines[i][1])
# Loop through successive frames and get the time points
time_points = [0]
i += 1
while len(lines[i]) > 1 and lines[i][0][:5] == 'Stamp':
delta_t = time_str_to_datetime(lines[i][1]) - t_0
time_points.append(delta_t.total_seconds())
i += 1
Let's look at our list and see how we did.
time_points
Great! We managed to extract the important metadata, we we could then use in our image analysis. We also now have nicely organized files that appear in alphabetical order.