Lesson 42: More plotting with Vega-Altair
[1]:
import pandas as pd
import altair as alt
In this lesson, we will learn about some fine-tuning with Vega-Altair, and also how to make some important kinds of plots. For this lesson, we will use the frog tongue data set from Kleinteich and Gorb. Let’s get the data frame loaded in so we can be on our way.
[2]:
df = pd.read_csv('data/frog_tongue_adhesion.csv', comment='#')
# Have a look so we remember
df.head()
[2]:
date | ID | trial number | impact force (mN) | impact time (ms) | impact force / body weight | adhesive force (mN) | time frog pulls on target (ms) | adhesive force / body weight | adhesive impulse (N-s) | total contact area (mm2) | contact area without mucus (mm2) | contact area with mucus / contact area without mucus | contact pressure (Pa) | adhesive strength (Pa) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013_02_26 | I | 3 | 1205 | 46 | 1.95 | -785 | 884 | 1.27 | -0.290 | 387 | 70 | 0.82 | 3117 | -2030 |
1 | 2013_02_26 | I | 4 | 2527 | 44 | 4.08 | -983 | 248 | 1.59 | -0.181 | 101 | 94 | 0.07 | 24923 | -9695 |
2 | 2013_03_01 | I | 1 | 1745 | 34 | 2.82 | -850 | 211 | 1.37 | -0.157 | 83 | 79 | 0.05 | 21020 | -10239 |
3 | 2013_03_01 | I | 2 | 1556 | 41 | 2.51 | -455 | 1025 | 0.74 | -0.170 | 330 | 158 | 0.52 | 4718 | -1381 |
4 | 2013_03_01 | I | 3 | 493 | 36 | 0.80 | -974 | 499 | 1.57 | -0.423 | 245 | 216 | 0.12 | 2012 | -3975 |
More control without shortcuts
We’ll start by making a scatter plot of adhesive force versus impact force as we did in a previous lesson.
[3]:
alt.Chart(df
).mark_point(
).encode(
x='impact force (mN):Q',
y='adhesive force (mN):Q',
color='ID:N',
).interactive()
[3]:
When we made the plot above, we used shortcuts for the encoding channels. That is, we passed the string 'impact force (mN):Q'
for the x
channel. The shorthand is really convenient, but if we want more control over the plot, we should use the class associated with a given channel. You can refer again to the Vega-Altair docs for the classes associated with each encoding channel. The class for the x
channel is alt.X
.
alt.X()
can take make kwargs, and you can use these to specify properties about how data are mapped to visual features on the plot. Similarly, alt.Color()
enables you to set properties about coloring.
To see how these work, let’s make the same plot, except with the x-axis on a logarithmic scale and with the title of the legend being changed to “Frog ID.”
[4]:
alt.Chart(df
).mark_point(
).encode(
x=alt.X('impact force (mN):Q').scale(alt.Scale(type='log')),
y='adhesive force (mN):Q',
color=alt.Color('ID:N').title('Frog ID')
).interactive()
[4]:
A watch-out about column names
Because of Vega’s specifications, Vega-Altair will not interpret brackets, dots, quotes, or percent signs in field names. That means that if you have a column in a data frame that has one of those characters, you will need to change the name of the column so that it does not have those characters. For example, you might have a column representing the concentration of a chemical, like [IPTG (mM)]
. In this case, you can use
df = df.rename(columns={'[IPTG (mM)]': 'iptg conc (mM)'})
to rename the appropriate column. You can then proceed to use the data frame in Altair, but you will need to be explicit about the axis label, encoding with something like
alt.X('iptg conc (mM)').title('[IPTG (mM)]')
Top-level properties
We may also want to set some global, or top-level, properties of the plot, such as its height and width (which are specified in units of pixels), label font size and weight, etc. These things can be adjusted using the configure_*()
methods (see Vega-Altair docs). Again, let’s learn by example.
[5]:
alt.Chart(df
).mark_point(
).encode(
x='impact force (mN):Q',
y='adhesive force (mN):Q',
color=alt.Color('ID:N').title('Frog ID')
).configure_view(
continuousHeight=200,
continuousWidth=400
).configure_axis(
titleFontSize=16,
titleFontWeight='normal'
).interactive()
[5]:
The Vega-Altair documentation is very good, and you can usually find what configuration you would need in there.
Plots with categorical variables
We may be interested in a single type of measurement, say impact force, for each frog. Here, we have a quantitative axis, the impact force, and a categorical axis, the frog ID. We do not really make scatter plots with these kind of data. We will explore a few plotting options for these kind of data now.
Strip plots
We could plot these using a strip plot. Here, each measurement is represented by a tick.
[6]:
alt.Chart(df
).mark_tick(
).encode(
x='impact force (mN):Q',
y=alt.Y('ID:N').title('Frog ID')
)
[6]:
This is a good plot to make since you are plotting all of your data, but it does have the problem that you cannot tell if multiple ticks overlap. Let’s look at some alternatives.
Histograms
Histograms are a popular way of displaying repeated measurements. We might want to make a histogram of the impact forces and stack the bars of the histogram so we can see which frogs contributed which portion of the counts of impact forces.
[7]:
alt.Chart(df
).mark_bar(
).encode(
x=alt.X('impact force (mN):Q').bin(True).title('impact force (mN)'),
y=alt.Y('count()').title('count'),
color=alt.Color('ID').title('Frog ID')
)
[7]:
This is informative, as we see that frog I contributes bigger impacts, and also that the adolescent frogs (frogs III and IV) do not strike with a force greater than one Newton.
Better alternatives for histograms
Histograms suffer from binning bias. By binning the data, you are not plotting all of them. In general, if you can plot all of your data. For that reason, I prefer not to use them, but rather to use ECDFs or jitter plots, which enables plotting of all data. As I mentioned before, if you do want to plot summary statistics, box plots are a reasonable alternative.
Jitter plots
In the strip plot, we plotted all of our data. We can make a similar plot with points instead of ticks. We can control the opacity of the marks to help us visualize overlap. This time, we will have the categorical axis (frog ID) on the x-axis.
[8]:
alt.Chart(df
).mark_point(
opacity=0.3
).encode(
y=alt.X('ID:N').title('Frog ID'),
x='impact force (mN):Q'
)
[8]:
This is nicer, but we sould like to visualize the points more clearly. So, instead of having the points all in one line for each frog, we can instead jitter the points in the x-direction by adding some random noise to it. In the grammar of Vega-Altair (which is the grammar of Vega/Vega-Lite), this jittering effect is a transform on the data, since we are still plotting against a categorical axis. That is, it is against the grammar to specify x and y positions of each point and then plot them while labeling the axis with a categorical variable like the frog ID. Rather, the jitter is a transform, which is part of the specification of the map of the data to its visual representation.
We have to explicitly apply the jitter transform; the Box-Muller transform is a commonly applied transform.
[9]:
alt.Chart(df, width=300
).mark_point(
).encode(
x="impact force (mN):Q",
y="ID:N",
yOffset="jitter:Q",
color=alt.Color('ID:N').legend(None),
).transform_calculate(
# Jitter with Box-Muller transform
jitter='sqrt(-2*log(random()))*cos(2*PI*random())'
)
[9]:
Box plots
Altair allows for construction of box plots as well.
[10]:
alt.Chart(df
).mark_boxplot(
).encode(
x='impact force (mN):Q',
y=alt.Y('ID:N').title(None),
color=alt.Color('ID:N').legend(None),
stroke=alt.Color('ID:N').legend(None),
)
[10]:
Box plots provide a summary of the data. The line in the middle of the box is the median and the top and bottom of the box at the 75th and 25th percentiles, respectively. The distance between the 25th and 75th percentiles is called the interquartile range, or IQR. The whiskers of the box plot extend a distance equal to 1.5 times the interquartile range, or to the extent of the data, whichever is least extreme. If data points are more extreme, they are shown individually, and are often referred to as outliers.
A box plot can use a useful visualization if you have many data points and it is difficult to plot them all. I rarely find that there are situations where all data cannot be plotted, either with jitter plots or ECDFs. Nonetheless, I do not find them too objectionable, as they effectively display important nonparametric summary statistics of your data set.
Bar charts
At this point, you may be asking if you can make bar graphs. This is a common type of plot in the biological literature. We can make this with Altair. But before I even begin this, I will give you the following piece of advice: Don’t make bar graphs. More on that in a moment. For now, here is a bar graph.
[11]:
alt.Chart(df
).mark_bar(
).encode(
x='mean(impact force (mN)):Q',
y=alt.Y('ID:N').title('Frog ID')
)
[11]:
Notice that we were able to cleverly put a mean
function into the string specifying the x
encoding channel. There are several functions we can use there. However, I do not advise doing this. Rather, use Pandas to get yourself a DataFrame
with whatever summary statistics you want to use, and then pass that to Altair. This enables you to have more explicit control over any statistical modeling you do. You should do this in general, not just for bar graphs (which you shouldn’t be making
anyway). Here is how you can do that, this time including a standard error of the mean calculation.
[12]:
# Make data frame with means and standard error of mean
df_summary = (df.groupby('ID')['impact force (mN)']
.agg(['mean', 'sem'])
.reset_index())
Let’s take a quick look at this summary DataFrame.
[13]:
df_summary
[13]:
ID | mean | sem | |
---|---|---|---|
0 | I | 1530.20 | 140.918782 |
1 | II | 707.35 | 94.937466 |
2 | III | 550.10 | 27.788477 |
3 | IV | 419.10 | 52.517260 |
We can use the sem
column to add error bars.
[14]:
# Add error bars to df_summary
df_summary['error_low'] = df_summary['mean'] - 1.96*df_summary['sem']
df_summary['error_high'] = df_summary['mean'] + 1.96*df_summary['sem']
# Take another look
df_summary
[14]:
ID | mean | sem | error_low | error_high | |
---|---|---|---|---|---|
0 | I | 1530.20 | 140.918782 | 1253.999187 | 1806.400813 |
1 | II | 707.35 | 94.937466 | 521.272566 | 893.427434 |
2 | III | 550.10 | 27.788477 | 495.634584 | 604.565416 |
3 | IV | 419.10 | 52.517260 | 316.166170 | 522.033830 |
This tidy DataFrame
can now be used to make the plot of the bar graph and can overlay the error bars.
[15]:
# Make a bar graph
bars = alt.Chart(df_summary
).mark_bar(
).encode(
x=alt.X('mean:Q').title('impact force (mN)'),
y=alt.Y('ID:N').title('Frog ID')
)
# Make the error bars
error_bars = alt.Chart(df_summary
).mark_rule(
).encode(
x='error_low:Q',
x2='error_high:Q',
y=alt.Y('ID:N').title('Frog ID')
)
# Overlay
chart = bars + error_bars
# Thin the bars a bit
chart.configure_scale(bandPaddingInner=0.5)
[15]:
Did you see that? To overlay plots, we just use the +
operator in Vega-Altair! So convenient!
So, this exercise in bar graphs with error bars just allowed you to learn about overlays in Vega-Altair. But I cannot stress this enough: Do not ever make a plot like this. There are so many reasons why. You are not plotting all of your data, and overlaying error bars computed from standard error of the mean implicitly assumes a statistical model (which is almost always not a good one). Please, I implore you, do not make bar graphs with error bars, and, in general….
Don’t make bar graphs
Bar graphs, especially with error bars, are typically awful. They are pervasive in biology papers. I have yet to find a single example where a bar graph is the best choice. Jitter plots or even box plots, are more informative and almost always preferred. In fact, ECDFs (those wonderful things I keep mentioning that we will soon get to in upcoming lessons) are often better even than these. Whether you use jitter plots or ECDFs, here is a simple message:
Don’t make bar graphs.
What should I do instead you ask? The answer is simple: plot all of your data when you can. If you can’t, box plots are always better than bar graphs.
ECDFs
Instead, and of course, you should see an ECDF using Vega-Altair!
[16]:
df["ECDF"] = df.groupby("ID")["impact force (mN)"].transform(
lambda x: x.rank(method="first") / len(x)
)
alt.Chart(df, height=200, width=300
).mark_point(
).encode(
x=alt.X("impact force (mN):Q").scale(alt.Scale(zero=False)),
y="ECDF:Q",
color="ID:N",
)
[16]:
Computing environment
[17]:
%load_ext watermark
%watermark -v -p pandas,altair,jupyterlab
Python implementation: CPython
Python version : 3.11.3
IPython version : 8.12.0
pandas : 1.5.3
altair : 5.0.1
jupyterlab: 3.6.3