Lesson 28: Survey of other packages and languages
[1]:
import numpy as np
import numba
In this bootcamp, we have used Python as the language of instruction. Because Python is an extendable language, it affords us to use domain specific packages. We have use Numpy for numerical computations, SciPy for special functions, statistics, and other scientific applications, Pandas for handling data sets, Bokeh for low-level plotting, HoloViews for high-level plotting, and Panel for dashboards.
There are plenty of other Python-based packages that can be useful in computing in the biological sciences, and hopefully you will write (and share) some of your own for your applications. In this lesson, we will review some other Python packages you may find useful in your work. We will also discuss other languages that you might employ for scientific computing.
Other useful Python packages
There are countless useful Python packages for scientific computing. Here, I am highlighting just a few. Actually, I am highlighting only ones I have come across and used in my own work. There are many, many more very high quality packages out there fore various domain specific applications that I am not covering here.
Data science
Quibbler
As described in its documentation, Quibbler is a toolset for building highly interactive, yet traceable, transparent and efficient data analysis pipelines. It allows for rapid building of dashboards, importantly allowing for saving of parameter settings you specify interactively. It also allows for asynchronous computing in Jupyter notebooks, allowing you to change the value of a parameter and all cells depending on that parameter get automatically updated.
Polars
Polars is a very powerful package for working with data frames. Under the hood, it is written in Rust and uses Arrow. This enables it to do operations, such as Boolean indexing and split-apply-combine, in a very efficient way, both in terms of memory and computation. It automatically does calculations it can in parallel, and it is not necessary to read an entire CSV file into memory to do so. I also very much like their domain-specific language. For large data sets that give Pandas trouble, Polars is a great way to go. Alternatively, you could use Dask, described below.
Dask
Dask allows for out-of-core computation with large data structures. For example, if your data set is too large to fit in RAM, thereby precluding you from using a Pandas data frame, you can use a Dask data frame, which will handle the out-of-core computing for you, and your data type will look an awful lot like a Pandas data frame. It also handles parallelization of calculations on large data sets.
xarray
xarray extends the concepts of Pandas data frames to more dimensions. It is convenient for organizing, accessing, and computing with more complex data structures.
Plotting
We have used Bokeh for plotting. As we say in our lesson on plotting, the landscape for Python plotting libraries is large. Here, I discuss a few other packages I have used.
Vega-Altair
Vega-Altair is a very nice plotting package that generates plots using Vega-Lite. It is high level and declarative. The plots are rendered using JavaScript and have some interactivity.
Plotly
Plotly is an excellent plotting package that is well-suiting for building dashboards and other interactive plots.
Matplotlib
Matplotlib is really the main plotting library for Python. It is the most fully featured and most widely used. It has some high-level functionality, but is primarily a lower level library for building highly customizable graphics.
Seaborn
Seaborn is a high-level statistical plotting package build on top of Matplotlib. I find its grammar clean and accessible; you can quickly make beautiful, informative graphics with it.
HoloViews
HoloViews is a powerful high-level plotting library that allows plots to be rendered using Bokeh, Matplotlib, or Plotly. It integrates well with Datashader, which allows plotting of millions and millions of data points.
Panel
Panel is a useful package that pairs nicely with Bokeh and HoloViews for quickly making dashboards.
Bioinformatics
Bioconda
Bioconda is not a Python package, but is a channel for the conda package manager that has many (7000+) bioinformatics packages. Most of these packages are not available through the default conda channel. This allows use of conda to keep all of your bioinformatics packages installed and organized.
Biopython
Biopython is a widely used package for parsing bioinformatics files of various flavors, managing sequence alignments, etc.
scikit-bio
scikit-bio has similar functionality as Biopython, but also includes some algorithms as well, for example for alignment and making phylogenetic trees.
Image processing
scikit-image
We haven’t covered image processing in the main portion of the bootcamp, but it is discussed in the auxiliary lessons. The main package we use in the bootcamp lessons is scikit-image, which has many classic image processing operations included.
napari
napari is an image viewer that allows for interactive viewing, annotation, and analysis of images, particularly large, multidimensional images. As of June 2023, it is still in alpha phase, but is a very promising tool!
DeepCell
These days, the state-of-the-art image segmentation tools use deep learning methods. DeepCell is developed at Caltech in the Van Valen lab, and is an excellent cell segmentation tool.
Machine learning
Python is widely used in machine learning applications, largely because it so easily wraps compiled code written in C or C++.
scikit-learn
scikit-learn is a widely used machine learning package for Python that does many standard machine learning tasks such as classification, clustering, dimensionality reduction, etc.
TensorFlow
TensorFlow is an extensive library for computation in machine learning developed by Google. It is especially effective for deep learning. It has a Python API.
Keras
In practice, you might rarely use TensorFlow’s core functionality, but rather use Keras to build deep learning models. Keras has an intuitive API and allows you to rapidly get up and running with deep learning.
PyTorch
PyTorch is a library similar to TensorFlow.
Statistics
In addition to the scipy.stats package, there are many packages for statistical analysis in the Python ecosystem.
statsmodels
statsmodels has extensive functionality for computing hypothesis tests, kernel density estimation, regression, time series analysis, and much more.
PyMC
PyMC is a probabilistic programming package primarily used for performing Markov chain Monte Carlo.
Stan/PyStan/CmdStanPy
Stan is a probabilistic programming language that uses state-of-the-art algorithms for Markov chain Monte Carlo and Bayesian inference. It is its own language, and you can access Stan models through two Python interfaces, PyStan and CmdStanPy. I prefer to use the latter, which is a much more lightweight interface.
ArviZ
ArviZ is a wonderful packages that generates output of various Bayesian inference packages in a unified format using xarray. Using ArviZ, you can use whatever MCMC package you like, and your downstream analysis will always use the same syntax.
Numba
Numba is a Python package for just-in-time compilation. The result is often greatly accelerated Python code, even beyond what Numpy can provide. It particularly excels when you have loops in your Python code. As an example, let’s consider taking a one-dimensional random walk. Here is a Python function to do that.
[2]:
def randwalk(n_steps):
steps = np.array([1, -1])
position = np.empty(n_steps+1, dtype=np.int64)
position[0] = 0
for i in range(n_steps):
position[i+1] = position[i] + np.random.choice(steps)
return position
We can use the %timeit
magic function to see how long it takes to compute a random walk of 100,000 steps.
[3]:
%timeit randwalk(100000)
567 ms ± 42.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It took about half a second on my machine to take the walk. We will now decorate the function with a @numba.njit
decorator, which tells Numba to compile the function.
[4]:
@numba.njit
def randwalk(n_steps):
steps = np.array([1, -1])
position = np.empty(n_steps+1, dtype=np.int64)
position[0] = 0
for i in range(n_steps):
position[i+1] = position[i] + np.random.choice(steps)
return position
Now, let’s time this one. Before we time it, though, we should run it once to do the compilation, so the compilation time is not included in the timing.
[5]:
randwalk(100000)
%timeit randwalk(100000)
155 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
This is a speedup of of over 1000, simply by adding the decorator!
Of course, there is a more clever way to do the random walk that will be even faster. (Inspect this function below to see how it is doing the same thing as the random walk in the above function. You might want to look up the documentation for the np.cumsum()
function.)
[6]:
def randwalk(n_steps):
return np.concatenate(
((0,), np.cumsum(np.random.choice([1, -1], size=100000)))
)
%timeit randwalk(100000)
656 µs ± 5.23 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In this case, we found a clever way around looping and could greatly speed up the calculation. But in the event that you do not have such an option, Numba can really add speed!
Other languages
We now turn to a survey other other computing languages.
Compiled languages
When you write code in a compiled language, the code you write is first translated, or compiled into a set of machine instructions that can be directly run by your machine’s CPU. Compile languages tend to be more verbose including requiring more direct instructions about how to allocate and free memory. They tend to be more low-level; you need more lines of code to do the same task compared to dynamic languages like Python, in which the code is interpreted (translated) into machine code as you run, and the interpreter handles a lot of the memory allocation and deallocation automatically behind the scenes.
While it often takes longer to develop code in compiled languages, they typically have much better speed because they require less overhead, and you are in a sense closer to the CPU. Pretty much any major numerical calculation is done with compiled code. Numpy is in many ways a Python wrapper around highly optimized compiled C libraries.
Fortran
Fortran was one of the first compiled language, first developed in 1956. As a result, it was actively developed for decades and has very good performance. Furthermore, huge Fortran code bases exist that have been reliably used and tested for decades. For this reason, Fortran is still widely used, particularly in physics, astronomy, and atmospheric science.
C
C (along with C++) is probably the most widely used compiled language across the sciences and elsewhere. In fact, the Python interpreter that we have been using this whole bootcamp is written in C.
C++
C++ is very much like C, except it is more feature rich, enabling object-oriented programming. Many bioinformatics algorithms are written in C++, though many also provide high-level interfaces in interpreted languages like R or Python.
Java
Nearly as widely used as C and C++, Java is a more modern compiled language. Unlike Fortran and C/C++, Java is compiled into bytecode, which is like machine code, but more portable. The bytecode is just-in-time compiled into machine code at runtime. Java is used in many bioinformatics applications.
Dynamic languages
Python
I think we’ve covered this one.
Ruby
Ruby is a high-level interpreted language that has fairly widespread use, particularly in web applications. In particular, Jekyll allows for rapid design of beautiful websites.
JavaScript
JavaScript is a core language for the web. Importantly, it allows for dynamic features in browser-based applications. Because of its central importance in this regard, major companies have spent substantial resources in developing very effective just-in-time compilers for it. Browsers have become highly optimized for running JavaScript code, resulting in excellent performance. In recent times, JavaScript has been adopted as a programming language for more substantial computation, including in the sciences. Due to the ability to create rich interactive graphics, it has also been adapted for use in data science.
As we saw in the lesson on making stand-along Bokeh apps with JavaScript, knowing a little JavaScript can greatly enhance the ways in which you can explore and share your research.
Domain-specific languages
Python is a general purpose language. It is used for all sorts of applications, in and outside of science, math, and engineering. There are several languages that are specifically designed for applications in science.
Matlab
Matlab was originally developed in the late 70’s by Caltech alumnus Cleve Moler as a way for his students to explore numerical linear algebra. It began as a convenient wrapper around Fortran routines in LINPACK.
As it did at its inception, Matlab excels in linear algebra applications. It has since expanded to include many other applications. It has widespread use in the biological sciences in image processing, and is also used to control instrumentation.
Matlab is proprietary and expensive (well, everything is expensive compared to the free software we’ve been using). This is a problem for research applications, because it sacrifices both access to the underlying algorithms and prices other researchers out of using it.
Mathematica
Mathematica is another proprietary scientific and numerical software originally written by a Caltech alumnus, this time Steven Wolfram. Technically, Mathematica is not a language; the language is called Wolfram. Its use is less widespread in biology, but it is widely used across the sciences. It is also not open source and is expensive.
R
R is a language designed for statistics, data science, and statistical graphics. It is highly extensible, and thousands of packages are available. Prominent among these are the packages in the tidyverse which allow for efficient and elegant manipulation of data frames and high level plotting via the excellent ggplot2 package. R has widespread use in bioinformatics and is a very effective language in these contexts.
Julia
Julia is a newer language specifically built for scientific computing. The developers of Julia put together a wish list for what they would want in a scientific programming language, and then build their language accordingly. Some of its features that I think are very valuable are:
It is free and open source
It has a built-in package manager
It has a large and rapidly growing set of well-developed packages; it is easily extendable.
You can call Python functions from Julia and vice-versa
Its language is intuitive, quite similar to Python.
Everything is just-in-time compiled. It is therefore blazingly fast.
In terms of performance, Julia is really fast, which is a big bonus. In contrast to R, which is really focused on statistics, Julia is a more general language for scientific computing (though it is not designed for applications outside of science and numerics, like Python is). It is strong in statistics and visualization (it has data frames, random number generation, and all those goodies available in packages) If there is another language besides Python that I could see offering the bootcamp in, it would be Julia.
Language wars are counterproductive
I have chosen to offer this bootcamp in Python for several reasons.
Python has a shallow learning curve; good for beginners.
Despite the shallow learning curve, Python and the available packages are extremely powerful and widely used.
Python-based tools are often very good.
In considering point 3, it is important to note that the Python-based tools are seldom the best for a particular task. If you are solving differential equations, Julia probably has a better tool. For many statistical analyses, R probably offers a better tool. But the Python-based tool for any of these applications both exists and is quite good. So, my hope is that the bootcamp has given you a Swiss Army knife in Python and its ecosystem. You have tools available to tackle most computational scientific problems you will encounter effectively.
If you choose to explore other languages or packages, it is important to choose the package the is right for you and your application. As you bounce around the internet, especially on social media, you will hear a lot of noise about people saying their language is the best for and some other language “sucks.” I find these arguments counterproductive not even worth reading. Rather, search for principled discussion on various tools. Inform yourself with the most informative voices, not the loudest ones.
Computing environment
[7]:
%load_ext watermark
%watermark -v -p numpy,numba,jupyterlab
Python implementation: CPython
Python version : 3.11.4
IPython version : 8.12.2
numpy : 1.24.3
numba : 0.57.0
jupyterlab: 4.0.5