Lesson 25: Survey of other packages and languages¶
[1]:
import numpy as np
import numba
import bokeh.plotting
import bokeh.io
import bokeh.models
bokeh.io.output_notebook()
In this bootcamp, we have used Python as the language of instruction. Because Python is an extendable language, it affords us to use domain specific packages. We have use Numpy for numerical computations, SciPy for special functions, statistics, and other scientific applications, Pandas for handling data sets, Bokeh for low-level plotting, HoloViews for high-level plotting, and Panel for dashboards.
There are plenty of other Python-based packages that can be useful in computing in the biological sciences, and hopefully you will write (and share) some of your own for your applications. In this lesson, we will review some other Python packages you may find useful in your work. We will also discuss other languages that you might employ for scientific computing.
Other useful Python packages¶
There are countless useful Python packages for scientific computing. Here, I am highlighting just a few. Actually, I am highlighting only ones I have come across and used in my own work. There are many, many more very high quality packages out there fore various domain specific applications that I am not covering here.
Data science¶
Dask¶
Dask allows for out-of-core computation with large data structures. For example, if your data set is too large to fit in RAM, thereby precluding you from using a Pandas data frame, you can use a Dask data frame, which will handle the out-of-core computing for you, and your data type will look an awful lot like a Pandas data frame. It also handles parallelization of calculations on large data sets.
Plotting¶
We have used Bokeh and HoloViews for plotting. As we say in our lesson on plotting, the landscape for Python plotting libraries is large. Here, I discuss a few other packages I have used.
Altair¶
Altair is a very nice plotting package that generates plots using Vega-Lite. It is high level and declarative. The plots are rendered using JavaScript and have some interactivity.
Matplotlib¶
Matplotlib is really the main plotting library for Python. It is the most fully featured and most widely used. It has some high-level functionality, but is primarily a lower level library for building highly customizable graphics.
Bioinformatics¶
Bioconda¶
Bioconda is not a Python package, but is a channel for the conda package manager that has many (7000+) bioinformatics packages. Most of these packages are not available through the default conda channel. This allows use of conda to keep all of your bioinformatics packages installed and organized.
Biopython¶
Biopython is a widely used package for parsing bioinformatics files of various flavors, managing sequence alignments, etc.
scikit-bio¶
scikit-bio has similar functionality as Biopython, but also includes some algorithms as well, for example for alignment and making phylogenetic trees.
Image processing¶
scikit-image¶
We haven’t covered image processing in the main portion of the bootcamp, but it is discussed in the auxiliary lessons. The main package we use in the bootcamp lessons is scikit-image, which has many classic image processing operations included.
DeepCell¶
These days, the state-of-the-art image segmentation tools use deep learning methods. DeepCell is developed at Caltech in the Van Valen lab, and is an excellent cell segmentation tool.
Machine learning¶
Python is widely used in machine learning applications, largely because it so easily wraps compiled code written in C or C++.
scikit-learn¶
scikit-learn is a widely used machine learning package for Python that does many standard machine learning tasks such as classification, clustering, dimensionality reduction, etc.
TensorFlow¶
TensorFlow is an extensive library for computation in machine learning developed by Google. It is especially effective for deep learning. It has a Python API.
Statistics¶
In addition to the scipy.stats package, there are many packages for statistical analysis in the Python ecosystem.
statsmodels¶
statsmodels has extensive functionality for computing hypothesis tests, kernel density estimation, regression, time series analysis, and much more.
PyMC3¶
PyMC3 is a probabilistic programming package primarily used for performing Markov chain Monte Carlo. It relies on Theano, which is no longer actively developed. PyMC4 will use TensorFlow, but this will result in a new API.
Stan/PyStan/CmdStanPy¶
Stan is a probabilistic programming language that uses state-of-the-art algorithms for Markov chain Monte Carlo and Bayesian inference. It is its own language, and you can access Stan models through two Python interfaces, PyStan and CmdStanPy. I prefer to use the latter, which is a much more lightweight interface.
Pyserial¶
pySerial is a useful package for communication with external devices using a serial port. If you are designing your own instruments for research and wish to control them with your computer via Python, you will almost certainly use this package.
Numba¶
Numba is a Python package for just-in-time compilation. The result is often greatly accelerated Python code, even beyond what Numpy can provide. It particularly excels when you have loops in your Python code. As an example, let’s consider taking a one-dimensional random walk. Here is a Python function to do that.
[2]:
def randwalk(n_steps):
steps = np.array([1, -1])
position = np.empty(n_steps+1, dtype=np.int64)
position[0] = 0
for i in range(n_steps):
position[i+1] = position[i] + np.random.choice(steps)
return position
We can use the %timeit
magic function to see how long it takes to compute a random walk of 100,000 steps.
[3]:
%timeit randwalk(100000)
782 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It took close to one second on my machine to take the walk. We will not decorate the function with a @numba.njit
decorator, which tells Numba to compile the function.
[4]:
@numba.njit
def randwalk(n_steps):
steps = np.array([1, -1])
position = np.empty(n_steps+1, dtype=np.int64)
position[0] = 0
for i in range(n_steps):
position[i+1] = position[i] + np.random.choice(steps)
return position
Now, let’s time this one. Before we time it, though, we should run it once to do the compilation, so the compilation time is not included in the timing.
[5]:
randwalk(100000)
%timeit randwalk(100000)
1.8 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This is a speedup of about a factor of 500, simply by adding the decorator!
Of course, there is a more clever way to do the random walk that will be even faster. (Inspect this function below to see how it is doing the same thing as the random walk in the above function. You might want to look up the documentation for the np.cumsum()
function.)
[6]:
def randwalk(n_steps):
return np.concatenate(
((0,), np.cumsum(np.random.choice([1, -1], size=100000)))
)
%timeit randwalk(100000)
775 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In this case, we found a clever way around looping and could greatly speed up the calculation. But in the event that you do not have such an option, Numba can really add speed!
Other languages¶
We now turn to a survey other other computing languages.
Compiled languages¶
When you write code in a compiled language, the code you write is first translated, or compiled into a set of machine instructions that can be directly run by your machine’s CPU. Compile languages tend to be more verbose including requiring more direct instructions about how to allocate and free memory. They tend to be more low-level; you need more lines of code to do the same task compared to dynamic languages like Python, in which the code is interpreted (translated) into machine code as you run, and the interpreter handles a lot of the memory allocation and deallocation automatically behind the scenes.
While it often takes longer to develop code in compiled languages, they typically have much better speed because they require less overhead, and you are in a sense closer to the CPU. Pretty much any major numerical calculation is done with compiled code. Numpy is in many ways a Python wrapper around highly optimized compiled C libraries.
Fortran¶
Fortran was one of the first compiled language, first developed in 1956. As a result, it was actively developed for decades and has very good performance. Furthermore, huge Fortran code bases exist that have been reliably used and tested for decades. For this reason, Fortran is still widely used, particularly in physics, astronomy, and atmospheric science.
C¶
C (along with C++) is probably the most widely used compiled language across the sciences and elsewhere. In fact, the Python interpreter that we have been using this whole bootcamp is written in C.
C++¶
C++ is very much like C, except it is more feature rich, enabling object-oriented programming. Many bioinformatics algorithms are written in C++, though many also provide high-level interfaces in interpreted languages like R or Python.
Java¶
Nearly as widely used as C and C++, Java is a more modern compiled language. Unlike Fortran and C/C++, Java is compiled into bytecode, which is like machine code, but more portable. The bytecode is just-in-time compiled into machine code at runtime. Java is used in many bioinformatics applications.
JavaScript¶
JavaScript is a core language for the web. Importantly, it allows for dynamic features in browser-based applications. Because of its central importance in this regard, major companies have spent substantial resources in developing very effective just-in-time compilers for it. Browsers have become highly optimized for running JavaScript code, resulting in excellent performance. In recent times, JavaScript has been adopted as a programming language for more substantial computation, including in the sciences. Due to the ability to create rich interactive graphics, it has also been adapted for use in data science.
From what we have learned here in the bootcamp, if you know some JavaScript, you can make really cool interactive graphics that will run natively in the browser without the need for a Python engine running behind them. As an example, below is a Bokeh plot that is interactive in the static HTML version of this notebook.
[7]:
# Make a slider for frequency
freq_slider = bokeh.models.Slider(
start=1,
end=10,
step=0.1,
value=1,
title='frequency'
)
# Plot of sine wave
p = bokeh.plotting.figure(
frame_height=100,
frame_width=300,
tools='',
x_range=[0, 1]
)
x = np.linspace(0, 1, 400)
source = bokeh.models.ColumnDataSource(dict(x=x, y=np.sin(2 * np.pi * x)))
p.line(source=source, x='x', y='y', line_width=2)
# JavaScript code for callback
js_code = """
let f = freq_slider.value;
let x = source.data['x'];
let y = source.data['y'];
for (let i = 0; i < x.length; i++) {
y[i] = Math.sin(2 * Math.PI * f * x[i]);
}
source.change.emit();
"""
# Make the callback
callback = bokeh.models.CustomJS(args=dict(source=source), code=js_code)
# We use the `js_on_change()` method to call the custom JavaScript code.
callback.args['freq_slider'] = freq_slider
freq_slider.js_on_change("value", callback)
bokeh.io.show(bokeh.layouts.column(freq_slider, p))
Domain-specific languages¶
Python is a general purpose language. It is used for all sorts of applications, in and outside of science, math, and engineering. There are several languages that are specifically designed for applications in science.
Matlab¶
Matlab was originally developed in the late 70’s by Caltech alumnus Cleve Moler as a way for his students to explore numerical linear algebra. It began as a convenient wrapper around Fortran routines in LINPACK.
As it did at its inception, Matlab excels in linear algebra applications. It has since expanded to include many other applications. It has widespread use in the biological sciences in image processing, and is also used to control instrumentation.
Matlab is proprietary and expensive (well, everything is expensive compared to the free software we’ve been using). This is a problem for research applications, because it sacrifices both access to the underlying algorithms and prices other researchers out of using it.
Mathematica¶
Mathematica is another proprietary scientific and numerical software originally written by a Caltech alumnus, this time Steven Wolfram. Technically, Mathematica is not a language; the language is called Wolfram. Its use is less widespread in biology, but it is widely used across the sciences. It is also not open source and is expensive.
R¶
R is a language designed for statistics, data science, and statistical graphics. It is highly extensible, and thousands of packages are available. Prominent among these are the packages in the tidyverse which allow for efficient and elegant manipulation of data frames and high level plotting via the excellent ggplot2 package. R has widespread use in bioinformatics and is a very effective language in these contexts.
Julia¶
Julia is a newer language specifically built for scientific computing. The developers of Julia put together a wish list for what they would want in a scientific programming language, and then build their language accordingly. Some of its features that I think are very valuable are:
It is free and open source
It has a built-in package manager
It has a large and rapidly growing set of well-developed packages; it is easily extendable.
You can call Python functions from Julia and vice-versa
Its language is intuitive, quite similar to Python.
Everything is just-in-time compiled. It is therefore blazingly fast.
In terms of performance, Julia is really fast, which is a big bonus. In contrast to R, which is really focused on statistics, Julia is a more general language for scientific computing (though it is not designed for applications outside of science and numerics, like Python is). It is strong in statistics and visualization (it has data frames, random number generation, and all those goodies available in packages) If there is another language besides Python that I could see offering the bootcamp in, it would be Julia.
Language wars are counterproductive¶
I have chosen to offer this bootcamp in Python for several reasons.
Python has a shallow learning curve; good for beginners.
Despite the shallow learning curve, Python and the available packages are extremely powerful and widely used.
Python-based tools are often very good.
In considering point 3, it is important to note that the Python-based tools are seldom the best for a particular task. If you are solving differential equations, Julia probably has a better tool. For many statistical analyses, R probably offers a better tool. But the Python-based tool for any of these applications both exists and is quite good. So, my hope is that the bootcamp has given you a Swiss Army knife in Python and its ecosystem. You have tools available to tackle most computational scientific problems you will encounter effectively.
If you choose to explore other languages or packages, it is important to choose the package the is right for you and your application. As you bounce around the internet, especially on social media, you will hear a lot of noise about people saying their language is the best for and some other language “sucks.” I find these arguments counterproductive not even worth reading. Rather, search for principled discussion on various tools. Inform yourself with the most informative voices, not the loudest ones.
Computing environment¶
[8]:
%load_ext watermark
%watermark -v -p numpy,numba,bokeh,jupyterlab
CPython 3.7.7
IPython 7.15.0
numpy 1.18.1
numba 0.49.1
bokeh 2.1.0
jupyterlab 2.1.4