Lesson 0: Setting up computing resources

In this lesson you will set up a Python computing environment for scientific computing on your own computer and also learn a bit about Google Colab, a cloud service for running Jupyter notebooks.

It is advantageous to learn how to set up a Python distribution and manage packages on your own machine, as each person can have different needs. That said, Google Colab is a nice, free resource to run Jupyter Notebooks on Google’s computers without any local installations necessary.

Setting up Google Colab

In order to use Google Colab, you must have a Google account. You can launch a Colab notebook by simply navigating to https://colab.research.google.com/. Alternatively, you can click the “Launch in Colab” badge at the top right of this page, and you will launch this notebook in Colab. That badge will appear in the top right of all pages in the course content generated from notebooks.

Watchouts when using Colab

If you do run a notebook in Colab, you are doing your computing on one of Google’s computers via a virtual machine. You get two CPU cores and limited (about 12 GB, but it varies) RAM. You can also get GPUs and TPUs (Google’s tensor processing units), but we will not use those in this course. The computing resources should be enough for all of our calculations this term (though you will need more computing power in the sequel of this course). However, there are some limitations you should be aware of.

  • If your notebook is idle for too long, you will get disconnected from your notebook. “Idle” means that cells are not being edited or executed. The idle timeout varies depending on the load on Google’s computers; I find that I almost always get disconnected if idle for an hour.

  • Your virtual machine will disconnect if it is being used for too long. It typically will only available for 12 hours before disconnecting, though times can vary, again based on load.

These limitations are in place so that Google can offer Colab for free. If you want more cores, longer timeouts, etc., you might want to check out Colab Pro. However, the free tier should work well for you in the course. You can of course always run on your own machine, and in fact are encouraged to do so.

There are additional software-specific watchouts when using Colab.

  • Colab does not allow for full functionality of Bokeh apps.

  • Colab instances have specific software installed, so you will need to install anything else you need in your notebook. This is not a major burden, and is discussed in the next section.

I recommend reading the Colab FAQs for more information about Colab.

Software in Colab

When you launch a Google Colab notebook, much of the software we will use in class is already installed. It is not always the latest version of the software, however. In fact, as of July 2024, Colab is running Python 3.10, whereas you will run Python 3.12 on your machine through your Anaconda installation. Nonetheless, most (but not all) of the analyses we do for this class will work just fine in Colab.

Because the notebooks in Colab have software preinstalled, and no more, you will often need to install software before you can run the rest of the code in a notebook. To enable this, when necessary, in the first code cell of each notebook in this class, we will have the following code (or a variant thereof depending on what is needed or if the default installations of Colab change). Running this code will not affect running your notebook on your local machine; the same notebook will work on your local machine or on Colab.

# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars iqplot bebi103 watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

In addition to installing the necessary software on a Colab instance, this also sets the relative path to data sets we will use in the course. When running in Colab, the data set is fetched from cloud storage on AWS. When running on your local machine for homeworks, the path to the data is one directory up from where you are working.

In most notebooks, the Colab and data path setup code cells are hidden in the HTML rendering to avoid clutter, but will be present when you download the notebooks.

Collaborating with Colab

If you want to collaborate with another student or with the course staff on a notebook, you can click “Share” on the top right corner of the Colab window and choose with whom and how (the defaults are fine) you want to share.

Installation on your own machine

We now proceed to discuss installation of the necessary software on your own machine. Before we get into that, there are some preliminaries for Windows users.

Windows users: Install Git and Chrome or Firefox

We will be using JupyterLab in this course. It is browser-based, and Chrome, Firefox, and Safari are supported. Microsoft Edge is not. Therefore, if you are a Windows user, you need to be sure you have either Chrome of Firefox installed.

Git is natively installed on Macs. For Windows users, you need to install Git by itself. You can do this by following the instructions here.

Downloading and installing Anaconda

If you already have Anaconda installed on your machine, you can skip this step.

Downloading and installing Anaconda is simple.

  1. Go to the Anaconda’s download page and click the Download button.

  2. Follow the on-screen instructions for installation. While doing so, be sure that Anaconda is installed in your home directory (which is the default), not in root.

That’s it! After you do that, you will have a functioning Python distribution.

Install node.js

node.js is a platform that enables you to run JavaScript outside of the browser. We will not use it directly, but it needs to be installed for some of the more sophisticated JupyterLab functionality. Install node.js by downloading the appropriate installer for your machine here.

Setting up a conda environment

I have created a conda environment for use in this class. You can download the YML specification for the environment (right-click and download):

You can set up and activate the environment on the command line or by using the Anaconda Navigator, which should be installed with Anaconda. You can do either of the two options, (a) or (b), below. I much prefer option (a), using the command line.

a) Activating from the command line

To set up your conda environment from the command line, navigate to the directory where you saved the pol_stats.yml file. Then, on the command line, enter

conda env create -f pol_stats.yml

This should build the environment for you (it may take several minutes). To then activate the environment, enter

conda activate pol_stats

on the command line.

b) Activating using the Anaconda Navigator

If you are using macOS, Anaconda Navigator will be available in your Applications menu. If you are using Windows, you can launch Anaconda Navigator from the Start menu.

When the Navigator window opens, select Environments on the left menu pane. Upon selecting Environments, you will see a pane immediately to the right of the Home/Environments/Learning/Community pane with a Search Environments window at the top. At the bottom of that pane, click Import. In the window that pops up, click on the folder icon under Local drive. Find the pol_stats.yml file you just downloaded. Click Import. It may take some time for the environment to be imported and built.

Stan installation

If you are only taking part 1 of the course, you can proceed tothe next section.

For part 2, we will be using Stan for much of our statistical modeling. Stan has a probabilistic programming language. Programs written in this language, called Stan programs, are translated into C++ by the Stan parser, and then the C++ code is compiled. As you will see throughout the class, there are many advantages to this approach.

There are many interfaces for Stan, including the two most widely used RStan and PyStan, which are R and Python interfaces, respectively. We will use a simple interface, CmdStanPy, which has several advantages that will become apparent when you start using it.

Whichever interface you use needs to have Stan installed and functional, which means you have to have an installed C++ toolchain. Installation and compilation can be tricky and varies from operating system to operating system. The instructions below are not guaranteed to work; you may have to do some troubleshooting on your own. Note that you can use Google Colab for computing as well, so you do not need to worry if you have trouble installing Stan locally. That said, it will definitely be to your advantage to have a local functioning Stan installation.

Configuring a C++ toolchain for MacOS

On MacOS, you an install Xcode command line tools by running the following on the command line.

xcode-select --install

Configuring a C++ toolchain for Windows

According to theCmdStanPy documentation, you can skip this step, though I did previously verify that the below worked on a Windows machine.

You need to install a C++ toolchain for Windows. One possibility is to install a MinGW toolchain, and one way to do that is using conda.

conda install libpython m2w64-toolchain -c msys2

When you do this, make sure you are in the bebi103 environment.

Configuring a C++ toolchain for Linux

If you are using Linux, we assume you already have the C++ utilities installed.

Installing Stan with CmdStanPy

If you have a functioning C++ toolchain, you can use CmdStanPy to install Stan/CmdStan. You can do this by running the following at a Python prompt (either Python, IPython, or in a Jupyter notebook) (again making sure you are in the bebi103 environment).

import cmdstanpy; cmdstanpy.install_cmdstan()

This may take several minutes to run. (I did it on my Raspberry Pi, and it took hours.)

If you are using Windows and you skipped configuration of the C++ toolchain, instead run:

import cmdstanpy; cmdstanpy.install_cmdstan(compiler=True)

Data sets

You should make sure you have all of the data sets we will use in the workshop downloaded to your machine. You can download all of the data sets from this link. To match the notes for the workshop, I advise the following directory structure.

pol-stats-workshop/
    data/
    lessons/
    exercises/

That way, when you are working on a lesson or exercise, the path to the data directory is always ../data/.

Launching JupyterLab

You can alternatively launch JupyterLab via the Anaconda Navigator or via your operating system’s terminal program (Terminal on macOS and PowerShell on Windows). If you wish to launch using the latter (which I prefer), skip to the next section.

In the Anaconda Navigator, click Home on the left pane. To the right, you will have a pane from which you can launch JupyterLab. On the top of the right pane, you will see two pulldown menus separated by the word “on.” Be sure you select pol_stats on the right pulldown menu. This ensures that you are using the environment you just set up.

You need to make sure you are using the pol_stats environment whenever you launch JupyterLab during the workshop.

You should see a card for JupyterLab. Do not confuse this with Notebook; you want to launch JupyterLab. Click Launch on the JupyterLab card. This will launch JupyterLab in your default browser.

Launching JupyterLab from the command line

While launching JupyterLab from the Anaconda Navigator is fine, I generally prefer to launch it from the command line on my own machine. If you are on a Mac, open the Terminal program. You can do this hitting Command + space bar and searching for “terminal.” Using Windows, you should launch PowerShell. You can do this by hitting Windows + R and typing “powershell” in the text box.

Once you have a terminal or PowerShell window open, you will have a prompt. At the prompt, type

conda activate pol_stats

This will ensure you are using the pol_stats environment you just created.

You need to make sure you are using the pol_stats environment whenever you launch JupyterLab, so you should do conda activate pol_stats each time you open a terminal.

Now that you have activated the pol_stats environment, you can launch JupyterLab by typing

jupyter lab

on the command line. You will have an instance of JupyterLab running in your default browser. If you want to specify the browser, you can, for example, type

jupyter lab --browser=firefox

on the command line.

It is up to you if you want to launch JupyterLab from the Anaconda Navigator or command line.

Checking your Stan installation

We’ll now run a quick test to make sure things are working properly. We will make a quick plot that requires some of the scientific libraries we will use, including Stan. If you are only taking part 1 of the course, you may skip to the next section.

Use the JupyterLab launcher (you can get a new launcher by clicking on the + icon on the left pane of your JupyterLab window) to launch a notebook. In the first cell (the box next to the [ ]: prompt), paste the code below. To run the code, press Shift+Enter while the cursor is active inside the cell.

It will take several seconds for the model to compile and then sample. In the end, you should see a scatter plot of samples. You might not appreciate it yet, but this is a nifty demonstration of Stan’s power to sample hierarchical models, which is no trivial feat. You will see some warning text, and that is expected.

You should see a plot that looks like the one below. If you do, you have a functioning Python environment for both parts of the course!

You can also test this in Colab (and it should work with no problems, though it will take a while to run because it goes through the Stan installation when run.).

[2]:
# Colab setup ------------------
import os, shutil, sys, subprocess, urllib.request
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    from bebi103.stan import install_cmdstan_colab
    install_cmdstan_colab()
else:
    data_path = "../data/"
# ------------------------------

import numpy as np

import bebi103
import cmdstanpy
import arviz as az

import bokeh.plotting
import bokeh.io
bokeh.io.output_notebook()

schools_data = {
    "J": 8,
    "y": [28, 8, -3, 7, -1, 1, 18, 12],
    "sigma": [15, 10, 16, 11, 9, 11, 10, 18],
}

schools_code = """
data {
  int<lower=0> J; // number of schools
  vector[J] y; // estimated treatment effects
  vector<lower=0>[J] sigma; // s.e. of effect estimates
}

parameters {
  real mu;
  real<lower=0> tau;
  vector[J] eta;
}

transformed parameters {
  vector[J] theta = mu + tau * eta;
}

model {
  eta ~ normal(0, 1);
  y ~ normal(theta, sigma);
}
"""

with open("schools_code.stan", "w") as f:
    f.write(schools_code)

sm = cmdstanpy.CmdStanModel(stan_file="schools_code.stan")
samples = sm.sample(data=schools_data, output_dir="./", show_progress=False)
samples = az.from_cmdstanpy(samples)
bebi103.stan.clean_cmdstan()

# Make a plot of samples
p = bokeh.plotting.figure(
    frame_height=250, frame_width=250, x_axis_label="μ", y_axis_label="τ"
)
p.scatter(
    np.ravel(samples.posterior["mu"]),
    np.ravel(samples.posterior["tau"]),
    alpha=0.1
)

bokeh.io.show(p)
Loading BokehJS ...
01:22:41 - cmdstanpy - INFO - compiling stan file /Users/bois/Dropbox/git/dd-pol-stats/2024/part_1/content/lessons/00/schools_code.stan to exe file /Users/bois/Dropbox/git/dd-pol-stats/2024/part_1/content/lessons/00/schools_code
01:22:55 - cmdstanpy - INFO - compiled model executable: /Users/bois/Dropbox/git/dd-pol-stats/2024/part_1/content/lessons/00/schools_code
01:22:56 - cmdstanpy - INFO - CmdStan start processing
01:22:56 - cmdstanpy - INFO - Chain [1] start processing
01:22:56 - cmdstanpy - INFO - Chain [2] start processing
01:22:56 - cmdstanpy - INFO - Chain [3] start processing
01:22:56 - cmdstanpy - INFO - Chain [4] start processing
01:22:56 - cmdstanpy - INFO - Chain [1] done processing
01:22:56 - cmdstanpy - INFO - Chain [2] done processing
01:22:56 - cmdstanpy - INFO - Chain [3] done processing
01:22:56 - cmdstanpy - INFO - Chain [4] done processing
01:22:56 - cmdstanpy - WARNING - Some chains may have failed to converge.
        Chain 1 had 1 divergent transitions (0.1%)
        Chain 3 had 1 divergent transitions (0.1%)
        Use the "diagnose()" method on the CmdStanMCMC object to see further information.

If you are only taking part 1

If you are only taking part 1 of the course, you will not need Stan. You can test your distribution on the code cell below. Hit Shift+Enter while the cursor is active inside the cell to run it.

[3]:
import numpy as np
import bokeh.plotting
import bokeh.io

bokeh.io.output_notebook()

# Generate plotting values
t = np.linspace(0, 2*np.pi, 200)
x = 16 * np.sin(t)**3
y = 13 * np.cos(t) - 5 * np.cos(2*t) - 2 * np.cos(3*t) - np.cos(4*t)

p = bokeh.plotting.figure(height=250, width=275)
p.line(x, y, color='red', line_width=3)
text = bokeh.models.Label(x=0, y=0, text='Physics of Life', text_align='center')
p.add_layout(text)

bokeh.io.show(p)
Loading BokehJS ...

Computing environment

[4]:
%load_ext watermark
%watermark -v -p numpy,bokeh,cmdstanpy,arviz,bebi103,jupyterlab
print("CmdStan : {0:d}.{1:d}".format(*cmdstanpy.cmdstan_version()))
Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.25.0

numpy     : 1.26.4
bokeh     : 3.4.1
cmdstanpy : 1.2.4
arviz     : 0.18.0
bebi103   : 0.1.23
jupyterlab: 4.0.13

CmdStan : 2.35