Lesson 41: High level plotting with Altair


[1]:
import pandas as pd

import altair as alt

In this lesson, we introduce another high-level plotting package, Altair. It uses the Vega/Vega-Lite visualization grammars, which interface to d3.js, a fantastic library for interactive visualizations based on JavaScript. Our major motivation for choosing Altair is that its grammar is clean and consistent. It also has good JupyterLab integration. This provides a good framework for building graphics.

We have a couple nice data sets from the previous lessons, the data from the tongue strikes of frogs and facial matching data from people with sleep deprivation. We will use those here.

[2]:
df = pd.read_csv('data/gfmt_sleep.csv', na_values='*')
df['insomnia'] = df['sci'] <= 16

Our first plot with Altair

Altair uses a declarative paradigm, meaning that you tell Altair what you want as opposed to telling it what to do. For example, let’s say I want to make a scatter plot of confidence when incorrect versus confidence when correct. We might expect some correlation here (people may just be confident in general, whether they are right or wrong), so this seems like something we would like to explore. So, let’s just make the first plot, and I will discuss the syntax. For now, just note that we imported Altair as alt.

[3]:
alt.Chart(df).mark_point().encode(x='confidence when correct',
                                  y='confidence when incorrect')
[3]:

In looking at the above syntax, remember that after each dot (except the first one) is a method associated with the object that was created. In this way, the plot was built in steps.

  1. alt.Chart(df) created a Chart object whose underlying data are in the DataFrame df.

  2. .mark_point() specifies that the marks on the chart are points.

  3. .encode(x='confidence when correct', y='confidence when incorrect') says that the positions of the points on the chart are encoded according to the confidence when correct and confidence when incorrect.

This is very much like English. > “Altair, give my a plot of data from my data frame where the data are plotted as points, and the x-values are the confidence correct and the y-values are the confidence when incorrect.”

Altair took care of the rest, like specifying plot dimensions, labeling the axes, having their ranges go from 0 to 100, stylizing the grid, etc. These can be adjusted, but at its basic level, this is how Altair makes a plot.

The importance of tidy data frames

Given how Altair interpreted our commands, it might be clear for you now that Altair requires that the data frame you use to build the chart be tidy. The organization of tidy data is really what enables this high level plotting functionality. There is a well-specified organization of the data.

Code style in plot specifications

Specifications of Altair plots involve lots of method chaining and can get unwieldy without a clear style. You can develop your own style, maybe reading Trey Hunner’s blog post again. I like to do the following.

  1. Put the alt.Chart(df on the first line.

  2. The closed parenthesis of the preceding line is one indentation level in.

  3. Any arguments of subsequent functions are given as kwargs two indentation levels in.

Here’s an example for the chart we just created.

[4]:
chart = alt.Chart(df
    ).mark_point(
    ).encode(
        x='confidence when correct',
        y='confidence when incorrect')

chart
[4]:

If you adhere to a style, it makes your code cleaner and easier to read.

Altair data types

When making a plot, we should specify the type of data we are encountering. For example, the confidence when incorrect consists of floats, so these are quantitative. By contrast, the gender column consists of only one of two strings. This column contains nominative data. The sci column consists of SCI scores, which can only take integer values. These data are not quantitative (in the sense of classifying data types), since they are discrete. Unlike the gender column, they do have an ordering, so the sci column contains ordinal data. Altair has a fourth data type, temporal which is used to describe columns in a data frame that have information about time.

Each data type has a shorthand that can be used in the specification. Here is a summary of the data types and their shorthand, taken largely from Altair’s docs

Data Type

Shorthand Code

Description

quantitative

Q

continuous, real

ordinal

O

discrete, ordered

nominal

N

discrete, unordered category

temporal

T

time

We can use the shorthand to specify data types by adding, e.g., :Q for quantitative data in an encoding. For example, a more complete specification for our plot is as follows (the output is the same, since Altair assumed a quantitative data type, but you should in general never count on Altair’s inferences, but specify the data type yourself).

[5]:
alt.Chart(df
    ).mark_point(
    ).encode(
        x='confidence when correct:Q',
        y='confidence when incorrect:Q')
[5]:

Altair marks

To specify the marks in an Altair chart, we use syntax like mark_point() or mark_line() to specify how the marks in the chart should appear. For example, if we wanted to make a strip plot of the confidence when incorrect values, we can use mark_tick.

[6]:
alt.Chart(df
    ).mark_tick(
    ).encode(x='confidence when incorrect:Q')
[6]:

There are many marks, and they can be found in Altair’s docs.

Altair encodings

An encoding maps properties of the data to visual properties. In the first plot we made, we mapped the confidence when correct to the x-position of a mark. We call a mapping of data to a visual property an encoding channel. The channel I just described is an x channel. There are many more visual properties besides position of marks. You can find a complete list in the Altair docs. color is a very commonly used encoding.

[7]:
alt.Chart(df
    ).mark_point(
    ).encode(
        x='confidence when correct:Q',
        y='confidence when incorrect:Q',
        color='insomnia:N')
[7]:

Notice that Altair automatically did the coloring and made a legend for you.

Interactive plotting with Altair

To make a plot interactive, which allows zooming and in and and other actions, you can simply add .intractive() to your chained functions. The interactivity works in JupyterLab, but does not currently work when the Jupyter notebook is exported to HTML (so if you are reading the HTML version of this document, you are only seeing static plots), but if you save the plot as HTML (see below), the interactivity is retained.

[8]:
alt.Chart(df
    ).mark_point(
    ).encode(
        x='confidence when correct:Q',
        y='confidence when incorrect:Q',
        color='insomnia:N'
    ).interactive()
[8]:

Of particular use are tooltips which give pop-up information when you hover over a mark on a chart. For example, we might like to know the gender and percent correct for each data point. We can add this with a tooltip encoding.

[9]:
alt.Chart(df
    ).mark_point(
    ).encode(
        x='confidence when correct:Q',
        y='confidence when incorrect:Q',
        color='insomnia:N',
        tooltip=['gender', 'insomnia', 'percent correct']
    ).interactive()
[9]:

Programmatically saving Altair charts

After you create your chart, you can save it to a variety of formats. Most commonly you would save them as PNG (for presentations), SVG (for publications in the paper of the past), and HTML (for the paper of the future). To do this, you can use the save() method of the Chart. It will automatically infer the file format you want based on the suffix of the file name choose to save the chart to. Note that in order to save as SVG, you need to have performed the optional installations of lesson 0.

[10]:
chart = alt.Chart(df
    ).mark_point(
    ).encode(
        x='confidence when correct:Q',
        y='confidence when incorrect:Q',
        color='insomnia:N',
        tooltip=['gender', 'insomnia', 'percent correct']
    ).interactive()

chart.save('confidence_scatter.html')

Computing environment

[11]:
%load_ext watermark
%watermark -v -p pandas,altair,jupyterlab
Python implementation: CPython
Python version       : 3.9.12
IPython version      : 8.3.0

pandas    : 1.4.2
altair    : 4.1.0
jupyterlab: 3.3.2