{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 42: More plotting with Altair\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import altair as alt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "In this lesson, we will learn about some fine-tuning with Altair, and also how to make some important kinds of plots. For this lesson, we will use [the frog tongue data set from Kleinteich and Gorb](https://doi.org/10.1038/srep05225). Let's get the data frame loaded in so we can be on our way." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateIDtrial numberimpact force (mN)impact time (ms)impact force / body weightadhesive force (mN)time frog pulls on target (ms)adhesive force / body weightadhesive impulse (N-s)total contact area (mm2)contact area without mucus (mm2)contact area with mucus / contact area without mucuscontact pressure (Pa)adhesive strength (Pa)
02013_02_26I31205461.95-7858841.27-0.290387700.823117-2030
12013_02_26I42527444.08-9832481.59-0.181101940.0724923-9695
22013_03_01I11745342.82-8502111.37-0.15783790.0521020-10239
32013_03_01I21556412.51-45510250.74-0.1703301580.524718-1381
42013_03_01I3493360.80-9744991.57-0.4232452160.122012-3975
\n", "
" ], "text/plain": [ " date ID trial number impact force (mN) impact time (ms) \\\n", "0 2013_02_26 I 3 1205 46 \n", "1 2013_02_26 I 4 2527 44 \n", "2 2013_03_01 I 1 1745 34 \n", "3 2013_03_01 I 2 1556 41 \n", "4 2013_03_01 I 3 493 36 \n", "\n", " impact force / body weight adhesive force (mN) \\\n", "0 1.95 -785 \n", "1 4.08 -983 \n", "2 2.82 -850 \n", "3 2.51 -455 \n", "4 0.80 -974 \n", "\n", " time frog pulls on target (ms) adhesive force / body weight \\\n", "0 884 1.27 \n", "1 248 1.59 \n", "2 211 1.37 \n", "3 1025 0.74 \n", "4 499 1.57 \n", "\n", " adhesive impulse (N-s) total contact area (mm2) \\\n", "0 -0.290 387 \n", "1 -0.181 101 \n", "2 -0.157 83 \n", "3 -0.170 330 \n", "4 -0.423 245 \n", "\n", " contact area without mucus (mm2) \\\n", "0 70 \n", "1 94 \n", "2 79 \n", "3 158 \n", "4 216 \n", "\n", " contact area with mucus / contact area without mucus \\\n", "0 0.82 \n", "1 0.07 \n", "2 0.05 \n", "3 0.52 \n", "4 0.12 \n", "\n", " contact pressure (Pa) adhesive strength (Pa) \n", "0 3117 -2030 \n", "1 24923 -9695 \n", "2 21020 -10239 \n", "3 4718 -1381 \n", "4 2012 -3975 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('data/frog_tongue_adhesion.csv', comment='#')\n", "\n", "# Have a look so we remember\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More control without shortcuts\n", "\n", "We'll start by making a scatter plot of adhesive force versus impact force as we did in the previous lesson." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df\n", " ).mark_point(\n", " ).encode(\n", " x='impact force (mN):Q',\n", " y='adhesive force (mN):Q',\n", " color='ID:N',\n", " ).interactive()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we made the plot above, we used shortcuts for the encoding channels. That is, we passed the string `'impact force (mN):Q'` for the `x` channel. The shorthand is really convenient, but if we want more control over the plot, we should use the class associated with a given channel. You can refer again to the [Altair docs](https://altair-viz.github.io/user_guide/encoding.html#encoding-channels) for the classes associated with each encoding channel. The class for the `x` channel is `alt.X`. `alt.X()` can take make kwargs, and you can use these to specify properties about how data are mapped to visual features on the plot. Similarly, `alt.Color()` enables you to set properties about coloring. \n", "\n", "To see how these work, let's make the same plot, except with the x-axis on a logarithmic scale and with the title of the legend being changed to \"Frog ID.\"" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df\n", " ).mark_point(\n", " ).encode(\n", " x=alt.X('impact force (mN)',\n", " type='quantitative',\n", " scale=alt.Scale(type='log')),\n", " y=alt.Y('adhesive force (mN)',\n", " type='quantitative'),\n", " color=alt.Color('ID:N',\n", " title='Frog ID')\n", " ).interactive()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A watch-out about column names\n", "\n", "Because of [Vega's specifications](https://vega.github.io/vega-lite/docs/field.html), Altair will not interpret brackets, dots, quotes, or percent signs in field names. That means that if you have a column in a data frame that has one of those characters, you will need to change the name of the column so that it does not have those characters. For example, you might have a column representing the concentration of a chemical, like `[IPTG (mM)]`. In this case, you can use\n", "```python\n", "df = df.rename(columns={'[IPTG (mM)]': 'iptg conc (mM)'})\n", "```\n", "to rename the appropriate column. You can then proceed to use the data frame in Altair, but you will need to be explicit about the axis label, encoding with something like\n", "```python\n", "alt.X('iptg conc (mM)', title='[IPTG (mM)]')\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Top-level properties \n", "\n", "We may also want to set some global, or top-level, properties of the plot, such as its height and width (which are specified in units of pixels), label font size and weight, etc. These things can be adjusted using the `configure_*()` methods (see [Altair docs](https://altair-viz.github.io/user_guide/configuration.html)). Again, let's learn by example." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df\n", " ).mark_point(\n", " ).encode(\n", " x='impact force (mN):Q',\n", " y='adhesive force (mN):Q',\n", " color=alt.Color('ID:N',\n", " title='Frog ID')\n", " ).configure_view(\n", " height=200,\n", " width=400\n", " ).configure_axis(\n", " titleFontSize=16,\n", " titleFontWeight='normal'\n", " ).interactive()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Altair documentation is very good, and you can usually find what configuration you would need in there." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plots with categorical variables\n", "\n", "We may be interested in a single type of measurement, say impact force, for each frog. Here, we have a quantitative axis, the impact force, and a **categorical axis**, the frog ID. We do not really make scatter plots with these kind of data. We will explore a few plotting options for these kind of data now." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Strip plots\n", "\n", "We could plot these using a **strip plot**. Here, each measurement is represented by a tick." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df\n", " ).mark_tick(\n", " ).encode(\n", " x='impact force (mN):Q',\n", " y=alt.Y('ID:N', title='Frog ID')\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a good plot to make since you are **plotting all of your data**, but it does have the problem that you cannot tell if multiple ticks overlap. Let's look at some alternatives." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Histograms\n", "\n", "Histograms are a popular way of displaying repeated measurements. We might want to make a histogram of the impact forces and stack the bars of the histogram so we can see which frogs contributed which portion of the counts of impact forces." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df\n", " ).mark_bar(\n", " ).encode(\n", " x=alt.X('impact force (mN):Q', bin=True, title='impact force (mN)'),\n", " y=alt.Y('count()', title='count'),\n", " color=alt.Color('ID', title='Frog ID')\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is informative, as we see that frog I contributes bigger impacts, and also that the adolescent frogs (frogs III and IV) do not strike with a force greater than one Newton." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Better alternatives for histograms\n", "\n", "Histograms suffer from **binning bias**. By binning the data, you are not plotting all of them. In general, if you can **plot all of your data**. For that reason, I prefer not to use them, but rather to use ECDFs or jitter plots, which enables plotting of all data. As I mentioned before, if you do want to plot summary statistics, box plots are a reasonable alternative." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Jitter plots\n", "\n", "In the strip plot, we plotted all of our data. We can make a similar plot with points instead of ticks. We can control the **opacity** of the marks to help us visualize overlap. This time, we will have the categorical axis (frog ID) on the x-axis." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df\n", " ).mark_point(\n", " opacity=0.3\n", " ).encode(\n", " x=alt.X('ID:N', title='Frog ID', axis=alt.Axis(labelAngle=0)),\n", " y='impact force (mN):Q'\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is nicer, but we sould like to visualize the points more clearly. So, instead of having the points all in one line for each frog, we can instead **jitter** the points in the x-direction by adding some random noise to it. In the grammar of Altair (which is the grammar of Vega/Vega-Lite), this jittering effect is a transform on the data, since we are still plotting against a categorical axis. That is, it is against the grammar to specify x and y positions of each point and then plot them while labeling the axis with a categorical variable like the frog ID. Rather, the jitter is a **transform**, which is part of the specification of the map of the data to its visual representation. \n", "\n", "Implementing jittered strip plots in Altair is a bit hacky. We use specify that the x-values are actually quantitative and then use faceting to group the categories together. We then place the facets close to each other. We have to explicitly apply the jitter transform; the Box-Muller transform is a commonly applied transform." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df, width=40\n", " ).mark_point(\n", " ).encode(\n", " x=alt.X(\n", " 'jitter:Q', \n", " title=None, \n", " axis=alt.Axis(values=[0], ticks=True, grid=False, labels=False), \n", " scale=alt.Scale()\n", " ),\n", " y='impact force (mN):Q',\n", " color=alt.Color('ID:N', legend=None),\n", " column=alt.Column(\n", " 'ID:N',\n", " header=alt.Header(\n", " titleOrient='top',\n", " labelOrient='bottom',\n", " labelPadding=3,\n", " )\n", " )\n", " ).transform_calculate(\n", " # Jitter with Box-Muller transform\n", " jitter='sqrt(-2*log(random()))*cos(2*PI*random())'\n", " ).configure_facet(\n", " spacing=0\n", " ).configure_view(\n", " stroke=None\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Box plots\n", "\n", "Altair allows for construction of box plots as well." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df\n", " ).mark_boxplot(\n", " ).encode(\n", " x='impact force (mN):Q',\n", " y=alt.Y('ID:N', title=None),\n", " color=alt.Color('ID:N', legend=None),\n", " stroke=alt.Color('ID:N', legend=None),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Box plots provide a summary of the data. The line in the middle of the box is the median and the top and bottom of the box at the 75th and 25th percentiles, respectively. The distance between the 25th and 75th percentiles is called the **interquartile range**, or IQR. The whiskers of the box plot extend a distance equal to 1.5 times the interquartile range, or to the extent of the data, whichever is least extreme. If data points are more extreme, they are shown individually, and are often referred to as outliers.\n", "\n", "A box plot can use a useful visualization if you have many data points and it is difficult to plot them all. I rarely find that there are situations where all data cannot be plotted, either with jitter plots or ECDFs. Nonetheless, I do not find them too objectionable, as they effectively display important nonparametric summary statistics of your data set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bar charts\n", "\n", "At this point, you may be asking if you can make bar graphs. This is a common type of plot in the biological literature. We can make this with Altair. But before I even begin this, I will give you the following piece of advice: *Don't make bar graphs.* More on that in a moment. For now, here is a bar graph." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df\n", " ).mark_bar(\n", " ).encode(\n", " x='mean(impact force (mN)):Q',\n", " y=alt.Y('ID:N', title='Frog ID'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that we were able to cleverly put a `mean` function into the string specifying the `x` encoding channel. There are several functions we can use there. However, I do not advise doing this. Rather, use Pandas to get yourself a `DataFrame` with whatever summary statistics you want to use, and then pass that to Altair. This enables you to have more explicit control over any statistical modeling you do. You should do this in general, not just for bar graphs (which you shouldn't be making anyway). Here is how you can do that, this time including a standard error of the mean calculation." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Make data frame with means and standard error of mean\n", "df_summary = (df.groupby('ID')['impact force (mN)']\n", " .agg(['mean', 'sem'])\n", " .reset_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a quick look at this summary DataFrame." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDmeansem
0I1530.20140.918782
1II707.3594.937466
2III550.1027.788477
3IV419.1052.517260
\n", "
" ], "text/plain": [ " ID mean sem\n", "0 I 1530.20 140.918782\n", "1 II 707.35 94.937466\n", "2 III 550.10 27.788477\n", "3 IV 419.10 52.517260" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the `sem` column to add error bars." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDmeansemerror_lowerror_high
0I1530.20140.9187821253.9991871806.400813
1II707.3594.937466521.272566893.427434
2III550.1027.788477495.634584604.565416
3IV419.1052.517260316.166170522.033830
\n", "
" ], "text/plain": [ " ID mean sem error_low error_high\n", "0 I 1530.20 140.918782 1253.999187 1806.400813\n", "1 II 707.35 94.937466 521.272566 893.427434\n", "2 III 550.10 27.788477 495.634584 604.565416\n", "3 IV 419.10 52.517260 316.166170 522.033830" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Add error bars to df_summary\n", "df_summary['error_low'] = df_summary['mean'] - 1.96*df_summary['sem']\n", "df_summary['error_high'] = df_summary['mean'] + 1.96*df_summary['sem']\n", "\n", "# Take another look\n", "df_summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tidy `DataFrame` can now be used to make the plot of the bar graph and can overlay the error bars." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Make a bar graph\n", "bars = alt.Chart(df_summary\n", " ).mark_bar(\n", " ).encode(\n", " x=alt.X('mean:Q', title='impact force (mN)'),\n", " y=alt.Y('ID:N', title='Frog ID'))\n", "\n", "# Make the error bars\n", "error_bars = alt.Chart(df_summary\n", " ).mark_rule(\n", " ).encode(\n", " x='error_low:Q',\n", " x2='error_high:Q',\n", " y=alt.Y('ID:N', title='Frog ID'))\n", "\n", "# Overlay\n", "chart = bars + error_bars\n", "\n", "# Thin the bars a bit\n", "chart.configure_scale(bandPaddingInner=0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Did you see that? To overlay plots, we just use the `+` operator in Altair! **So** convenient!\n", "\n", "So, this exercise in bar graphs with error bars just allowed you to learn about overlays in Altair. But I cannot stress this enough: Do not ever make a plot like this. There are so many reasons why. You are not plotting all of your data, and overlaying error bars computed from standard error of the mean implicitly assumes a statistical model (which is almost always not a good one). **Please**, I implore you, do not make bar graphs with error bars, and, in general...." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Don't make bar graphs\n", "\n", "Bar graphs, especially with error bars, are typically awful. They are pervasive in biology papers. I have yet to find a single example where a bar graph is the best choice. Jitter plots or even box plots, are more informative and almost always preferred. In fact, ECDFs (those wonderful things I keep mentioning that we will soon get to in upcoming lessons) are often better even than these. Whether you use jitter plots or ECDFs, here is a simple message:\n", "\n", "
\n", "
Don't make bar graphs.
\n", "
\n", "\n", "What should I do instead you ask? The answer is simple: plot all of your data when you can. If you can't, box plots are always better than bar graphs." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }