{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 41: High level plotting with Altair\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "import altair as alt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "In this lesson, we introduce another high-level plotting package, [Altair](https://altair-viz.github.io). It uses the [Vega/Vega-Lite](https://vega.github.io) visualization grammars, which interface to [d3.js](https://d3js.org), a fantastic library for interactive visualizations based on JavaScript. Our major motivation for choosing Altair is that its grammar is clean and consistent. It also has good JupyterLab integration. This provides a good framework for building graphics.\n", "\n", "We have a couple nice data sets from the previous lessons, the data from the tongue strikes of frogs and facial matching data from people with sleep deprivation. We will use those here." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('data/gfmt_sleep.csv', na_values='*')\n", "df['insomnia'] = df['sci'] <= 16" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Our first plot with Altair\n", "\n", "Altair uses a [declarative paradigm](https://en.wikipedia.org/wiki/Declarative_programming), meaning that you tell Altair what *you want* as opposed to telling it what *to do*. For example, let's say I want to make a scatter plot of confidence when incorrect versus confidence when correct. We might expect some correlation here (people may just be confident in general, whether they are right or wrong), so this seems like something we would like to explore. So, let's just make the first plot, and I will discuss the syntax. For now, just note that we imported Altair as `alt`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df).mark_point().encode(x='confidence when correct',\n", " y='confidence when incorrect')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In looking at the above syntax, remember that after each dot (except the first one) is a method associated with the object that was created. In this way, the plot was built in steps.\n", "\n", "1. `alt.Chart(df)` created a `Chart` object whose underlying data are in the `DataFrame` `df`.\n", "2. `.mark_point()` specifies that the **marks** on the chart are points.\n", "3. `.encode(x='confidence when correct', y='confidence when incorrect')` says that the positions of the points on the chart are encoded according to the `confidence when correct` and `confidence when incorrect`.\n", "\n", "This is very much like English.\n", "> \"Altair, give my a plot of data from my data frame where the data are plotted as points, and the x-values are the confidence correct and the y-values are the confidence when incorrect.\"\n", "\n", "Altair took care of the rest, like specifying plot dimensions, labeling the axes, having their ranges go from 0 to 100, stylizing the grid, etc. These can be adjusted, but at its basic level, this is how Altair makes a plot." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The importance of tidy data frames\n", "\n", "Given how Altair interpreted our commands, it might be clear for you now that Altair requires that the data frame you use to build the chart be [**tidy**](l17_split_apply_combine.html). The organization of tidy data is really what enables this high level plotting functionality. There is a well-specified organization of the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Code style in plot specifications\n", "\n", "Specifications of Altair plots involve lots of **method chaining** and can get unwieldy without a clear style. You can develop your own style, maybe reading [Trey Hunner's blog post again](http://treyhunner.com/2017/07/craft-your-python-like-poetry/). I like to do the following.\n", "\n", "1. Put the `alt.Chart(df` on the first line.\n", "2. The closed parenthesis of the preceding line is one indentation level in.\n", "3. Any arguments of subsequent functions are given as kwargs two indentation levels in.\n", "\n", "Here's an example for the chart we just created." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chart = alt.Chart(df\n", " ).mark_point(\n", " ).encode(\n", " x='confidence when correct',\n", " y='confidence when incorrect')\n", "\n", "chart" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you adhere to a style, it makes your code cleaner and easier to read." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Altair data types\n", "\n", "When making a plot, we should specify the type of data we are encountering. For example, the `confidence when incorrect` consists of `float`s, so these are **quantitative**. By contrast, the `gender` column consists of only one of two strings. This column contains **nominative** data. The `sci` column consists of SCI scores, which can only take integer values. These data are not quantitative (in the sense of classifying data types), since they are discrete. Unlike the `gender` column, they do have an ordering, so the `sci` column contains **ordinal** data. Altair has a fourth data type, **temporal** which is used to describe columns in a data frame that have information about time.\n", "\n", "Each data type has a shorthand that can be used in the specification. Here is a summary of the data types and their shorthand, taken largely from [Altair's docs](https://altair-viz.github.io/user_guide/encoding.html#data-types)\n", "\n", "|Data Type|Shorthand Code|Description|\n", "|:-- |:-- |:-- |\n", "|quantitative|`Q`|continuous, real|\n", "|ordinal|`O`|discrete, ordered|\n", "|nominal|`N`|discrete, unordered category|\n", "|temporal|`T`|time|\n", "\n", "We can use the shorthand to specify data types by adding, e.g., `:Q` for quantitative data in an encoding. For example, a more complete specification for our plot is as follows (the output is the same, since Altair assumed a quantitative data type, but you should in general never count on Altair's inferences, but specify the data type yourself)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df\n", " ).mark_point(\n", " ).encode(\n", " x='confidence when correct:Q',\n", " y='confidence when incorrect:Q')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Altair marks\n", "\n", "To specify the marks in an Altair chart, we use syntax like `mark_point()` or `mark_line()` to specify how the marks in the chart should appear. For example, if we wanted to make a strip plot of the confidence when incorrect values, we can use `mark_tick`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df\n", " ).mark_tick(\n", " ).encode(x='confidence when incorrect:Q')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many marks, and they can be found in [Altair's docs](https://altair-viz.github.io/user_guide/marks.html#marks)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Altair encodings\n", "\n", "An **encoding** maps properties of the data to visual properties. In the first plot we made, we mapped the confidence when correct to the x-position of a mark. We call a mapping of data to a visual property an **encoding channel**. The channel I just described is an `x` channel. There are many more visual properties besides position of marks. You can find a complete list in the [Altair docs](https://altair-viz.github.io/user_guide/encoding.html#encoding-channels). `color` is a very commonly used encoding." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df\n", " ).mark_point(\n", " ).encode(\n", " x='confidence when correct:Q',\n", " y='confidence when incorrect:Q',\n", " color='insomnia:N')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that Altair automatically did the coloring and made a legend for you." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Interactive plotting with Altair\n", "\n", "To make a plot interactive, which allows zooming and in and and other actions, you can simply add `.intractive()` to your chained functions. The interactivity works in JupyterLab, but does not currently work when the Jupyter notebook is exported to HTML (so if you are reading the HTML version of this document, you are only seeing static plots), but if you save the plot as HTML (see below), the interactivity is retained." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df\n", " ).mark_point(\n", " ).encode(\n", " x='confidence when correct:Q',\n", " y='confidence when incorrect:Q',\n", " color='insomnia:N'\n", " ).interactive()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of particular use are **tooltips** which give pop-up information when you hover over a mark on a chart. For example, we might like to know the gender and percent correct for each data point. We can add this with a `tooltip` encoding." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df\n", " ).mark_point(\n", " ).encode(\n", " x='confidence when correct:Q', \n", " y='confidence when incorrect:Q',\n", " color='insomnia:N',\n", " tooltip=['gender', 'insomnia', 'percent correct']\n", " ).interactive()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Programmatically saving Altair charts\n", "\n", "After you create your chart, you can save it to a variety of formats. Most commonly you would save them as PNG (for presentations), SVG (for publications in the paper of the past), and HTML (for the paper of the future). To do this, you can use the `save()` method of the `Chart`. It will automatically infer the file format you want based on the suffix of the file name choose to save the chart to. Note that in order to save as SVG, you need to have performed the [optional installations of lesson 0](l00_configuring_your_computer.html#Optional-installations)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "chart = alt.Chart(df\n", " ).mark_point(\n", " ).encode(\n", " x='confidence when correct:Q',\n", " y='confidence when incorrect:Q',\n", " color='insomnia:N',\n", " tooltip=['gender', 'insomnia', 'percent correct']\n", " ).interactive()\n", "\n", "chart.save('confidence_scatter.html')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing environment" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python implementation: CPython\n", "Python version : 3.9.12\n", "IPython version : 8.3.0\n", "\n", "pandas : 1.4.2\n", "altair : 4.1.0\n", "jupyterlab: 3.3.2\n", "\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -v -p pandas,altair,jupyterlab" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }