Lesson 28: Survey of other packages and languages


[1]:
import numpy as np
import numba

In this bootcamp, we have used Python as the language of instruction. Because Python is an extendable language, it affords us to use domain specific packages. We have use Numpy for numerical computations, SciPy for special functions, statistics, and other scientific applications, Pandas for handling data sets, Bokeh for low-level plotting, HoloViews for high-level plotting, and Panel for dashboards.

There are plenty of other Python-based packages that can be useful in computing in the biological sciences, and hopefully you will write (and share) some of your own for your applications. In this lesson, we will review some other Python packages you may find useful in your work. We will also discuss other languages that you might employ for scientific computing.

Other useful Python packages

There are countless useful Python packages for scientific computing. Here, I am highlighting just a few. Actually, I am highlighting only ones I have come across and used in my own work. There are many, many more very high quality packages out there fore various domain specific applications that I am not covering here.

Data science

Dask

Dask allows for out-of-core computation with large data structures. For example, if your data set is too large to fit in RAM, thereby precluding you from using a Pandas data frame, you can use a Dask data frame, which will handle the out-of-core computing for you, and your data type will look an awful lot like a Pandas data frame. It also handles parallelization of calculations on large data sets.

xarray

xarray extends the concepts of Pandas data frames to more dimensions. It is convenient for organizing, accessing, and computing with more complex data structures.

Plotting

We have used Bokeh for plotting. As we say in our lesson on plotting, the landscape for Python plotting libraries is large. Here, I discuss a few other packages I have used.

Altair

Altair is a very nice plotting package that generates plots using Vega-Lite. It is high level and declarative. The plots are rendered using JavaScript and have some interactivity.

Plotly

Plotly is an excellent plotting package that is well-suiting for building dashboards and other interactive plots.

Matplotlib

Matplotlib is really the main plotting library for Python. It is the most fully featured and most widely used. It has some high-level functionality, but is primarily a lower level library for building highly customizable graphics.

Seaborn

Seaborn is a high-level statistical plotting package build on top of Matplotlib. I find its grammar clean and accessible; you can quickly make beautiful, informative graphics with it.

HoloViews

HoloViews is a powerful high-level plotting library that allows plots to be rendered using Bokeh, Matplotlib, or Plotly. It integrates well with Datashader, which allows plotting of millions and millions of data points.

Panel

Panel is a useful package that pairs nicely with Bokeh and HoloViews for quickly making dashboards.

Bioinformatics

Bioconda

Bioconda is not a Python package, but is a channel for the conda package manager that has many (7000+) bioinformatics packages. Most of these packages are not available through the default conda channel. This allows use of conda to keep all of your bioinformatics packages installed and organized.

Biopython

Biopython is a widely used package for parsing bioinformatics files of various flavors, managing sequence alignments, etc.

scikit-bio

scikit-bio has similar functionality as Biopython, but also includes some algorithms as well, for example for alignment and making phylogenetic trees.

Image processing

scikit-image

We haven’t covered image processing in the main portion of the bootcamp, but it is discussed in the auxiliary lessons. The main package we use in the bootcamp lessons is scikit-image, which has many classic image processing operations included.

DeepCell

These days, the state-of-the-art image segmentation tools use deep learning methods. DeepCell is developed at Caltech in the Van Valen lab, and is an excellent cell segmentation tool.

Machine learning

Python is widely used in machine learning applications, largely because it so easily wraps compiled code written in C or C++.

scikit-learn

scikit-learn is a widely used machine learning package for Python that does many standard machine learning tasks such as classification, clustering, dimensionality reduction, etc.

TensorFlow

TensorFlow is an extensive library for computation in machine learning developed by Google. It is especially effective for deep learning. It has a Python API.

Keras

In practice, you might rarely use TensorFlow’s core functionality, but rather use Keras to build deep learning models. Keras has an intuitive API and allows you to rapidly get up and running with deep learning.

PyTorch

PyTorch is a library similar to TensorFlow.

Statistics

In addition to the scipy.stats package, there are many packages for statistical analysis in the Python ecosystem.

statsmodels

statsmodels has extensive functionality for computing hypothesis tests, kernel density estimation, regression, time series analysis, and much more.

PyMC

PyMC is a probabilistic programming package primarily used for performing Markov chain Monte Carlo.

Stan/PyStan/CmdStanPy

Stan is a probabilistic programming language that uses state-of-the-art algorithms for Markov chain Monte Carlo and Bayesian inference. It is its own language, and you can access Stan models through two Python interfaces, PyStan and CmdStanPy. I prefer to use the latter, which is a much more lightweight interface.

ArviZ

ArviZ is a wonderful packages that generates output of various Bayesian inference packages in a unified format using xarray. Using ArviZ, you can use whatever MCMC package you like, and your downstream analysis will always use the same syntax.

Numba

Numba is a Python package for just-in-time compilation. The result is often greatly accelerated Python code, even beyond what Numpy can provide. It particularly excels when you have loops in your Python code. As an example, let’s consider taking a one-dimensional random walk. Here is a Python function to do that.

[2]:
def randwalk(n_steps):
    steps = np.array([1, -1])

    position = np.empty(n_steps+1, dtype=np.int64)

    position[0] = 0
    for i in range(n_steps):
        position[i+1] = position[i] + np.random.choice(steps)

    return position

We can use the %timeit magic function to see how long it takes to compute a random walk of 100,000 steps.

[3]:
%timeit randwalk(100000)
1.16 s ± 20.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It took close to one second on my machine to take the walk. We will now decorate the function with a @numba.njit decorator, which tells Numba to compile the function.

[4]:
@numba.njit
def randwalk(n_steps):
    steps = np.array([1, -1])

    position = np.empty(n_steps+1, dtype=np.int64)

    position[0] = 0
    for i in range(n_steps):
        position[i+1] = position[i] + np.random.choice(steps)

    return position

Now, let’s time this one. Before we time it, though, we should run it once to do the compilation, so the compilation time is not included in the timing.

[5]:
randwalk(100000)

%timeit randwalk(100000)
918 µs ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This is a speedup of about a factor of 500, simply by adding the decorator!

Of course, there is a more clever way to do the random walk that will be even faster. (Inspect this function below to see how it is doing the same thing as the random walk in the above function. You might want to look up the documentation for the np.cumsum() function.)

[6]:
def randwalk(n_steps):
    return np.concatenate(
        ((0,), np.cumsum(np.random.choice([1, -1], size=100000)))
    )

%timeit randwalk(100000)
704 µs ± 23.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In this case, we found a clever way around looping and could greatly speed up the calculation. But in the event that you do not have such an option, Numba can really add speed!

Other languages

We now turn to a survey other other computing languages.

Compiled languages

When you write code in a compiled language, the code you write is first translated, or compiled into a set of machine instructions that can be directly run by your machine’s CPU. Compile languages tend to be more verbose including requiring more direct instructions about how to allocate and free memory. They tend to be more low-level; you need more lines of code to do the same task compared to dynamic languages like Python, in which the code is interpreted (translated) into machine code as you run, and the interpreter handles a lot of the memory allocation and deallocation automatically behind the scenes.

While it often takes longer to develop code in compiled languages, they typically have much better speed because they require less overhead, and you are in a sense closer to the CPU. Pretty much any major numerical calculation is done with compiled code. Numpy is in many ways a Python wrapper around highly optimized compiled C libraries.

Fortran

Fortran was one of the first compiled language, first developed in 1956. As a result, it was actively developed for decades and has very good performance. Furthermore, huge Fortran code bases exist that have been reliably used and tested for decades. For this reason, Fortran is still widely used, particularly in physics, astronomy, and atmospheric science.

C

C (along with C++) is probably the most widely used compiled language across the sciences and elsewhere. In fact, the Python interpreter that we have been using this whole bootcamp is written in C.

C++

C++ is very much like C, except it is more feature rich, enabling object-oriented programming. Many bioinformatics algorithms are written in C++, though many also provide high-level interfaces in interpreted languages like R or Python.

Java

Nearly as widely used as C and C++, Java is a more modern compiled language. Unlike Fortran and C/C++, Java is compiled into bytecode, which is like machine code, but more portable. The bytecode is just-in-time compiled into machine code at runtime. Java is used in many bioinformatics applications.

Dynamic languages

Python

I think we’ve covered this one.

Ruby

Ruby is a high-level interpreted language that has fairly widespread use, particularly in web applications. In particular, Jekyll allows for rapid design of beautiful websites.

JavaScript

JavaScript is a core language for the web. Importantly, it allows for dynamic features in browser-based applications. Because of its central importance in this regard, major companies have spent substantial resources in developing very effective just-in-time compilers for it. Browsers have become highly optimized for running JavaScript code, resulting in excellent performance. In recent times, JavaScript has been adopted as a programming language for more substantial computation, including in the sciences. Due to the ability to create rich interactive graphics, it has also been adapted for use in data science.

As we saw in the lesson on making stand-along Bokeh apps with JavaScript, knowing a little JavaScript can greatly enhance the ways in which you can explore and share your research.

Domain-specific languages

Python is a general purpose language. It is used for all sorts of applications, in and outside of science, math, and engineering. There are several languages that are specifically designed for applications in science.

Matlab

Matlab was originally developed in the late 70’s by Caltech alumnus Cleve Moler as a way for his students to explore numerical linear algebra. It began as a convenient wrapper around Fortran routines in LINPACK.

As it did at its inception, Matlab excels in linear algebra applications. It has since expanded to include many other applications. It has widespread use in the biological sciences in image processing, and is also used to control instrumentation.

Matlab is proprietary and expensive (well, everything is expensive compared to the free software we’ve been using). This is a problem for research applications, because it sacrifices both access to the underlying algorithms and prices other researchers out of using it.

Mathematica

Mathematica is another proprietary scientific and numerical software originally written by a Caltech alumnus, this time Steven Wolfram. Technically, Mathematica is not a language; the language is called Wolfram. Its use is less widespread in biology, but it is widely used across the sciences. It is also not open source and is expensive.

R

R is a language designed for statistics, data science, and statistical graphics. It is highly extensible, and thousands of packages are available. Prominent among these are the packages in the tidyverse which allow for efficient and elegant manipulation of data frames and high level plotting via the excellent ggplot2 package. R has widespread use in bioinformatics and is a very effective language in these contexts.

Julia

Julia is a newer language specifically built for scientific computing. The developers of Julia put together a wish list for what they would want in a scientific programming language, and then build their language accordingly. Some of its features that I think are very valuable are:

  • It is free and open source

  • It has a built-in package manager

  • It has a large and rapidly growing set of well-developed packages; it is easily extendable.

  • You can call Python functions from Julia and vice-versa

  • Its language is intuitive, quite similar to Python.

  • Everything is just-in-time compiled. It is therefore blazingly fast.

In terms of performance, Julia is really fast, which is a big bonus. In contrast to R, which is really focused on statistics, Julia is a more general language for scientific computing (though it is not designed for applications outside of science and numerics, like Python is). It is strong in statistics and visualization (it has data frames, random number generation, and all those goodies available in packages) If there is another language besides Python that I could see offering the bootcamp in, it would be Julia.

Language wars are counterproductive

I have chosen to offer this bootcamp in Python for several reasons.

  1. Python has a shallow learning curve; good for beginners.

  2. Despite the shallow learning curve, Python and the available packages are extremely powerful and widely used.

  3. Python-based tools are often very good.

In considering point 3, it is important to note that the Python-based tools are seldom the best for a particular task. If you are solving differential equations, Julia probably has a better tool. For many statistical analyses, R probably offers a better tool. But the Python-based tool for any of these applications both exists and is quite good. So, my hope is that the bootcamp has given you a Swiss Army knife in Python and its ecosystem. You have tools available to tackle most computational scientific problems you will encounter effectively.

If you choose to explore other languages or packages, it is important to choose the package the is right for you and your application. As you bounce around the internet, especially on social media, you will hear a lot of noise about people saying their language is the best for and some other language “sucks.” I find these arguments counterproductive not even worth reading. Rather, search for principled discussion on various tools. Inform yourself with the most informative voices, not the loudest ones.

Computing environment

[7]:
%load_ext watermark
%watermark -v -p numpy,numba,jupyterlab
Python implementation: CPython
Python version       : 3.9.12
IPython version      : 8.3.0

numpy     : 1.21.5
numba     : 0.55.1
jupyterlab: 3.3.2