(c) 2019 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This lesson was generated from a Jupyter notebook. You can download the notebook here.
import numpy as np
import pandas as pd
import bokeh_catplot
import bokeh.plotting
import bokeh.io
bokeh.io.output_notebook()
In this lesson, do some plotting with a high-level package Bokeh-catplot. If you haven't already, you should install it.
pip install --upgrade bokeh-catplot
For this lesson, we will use the frog tongue data set from Kleinteich and Gorb that we used in lesson 21. Let's get the data frame loaded in so we can be on our way.
df = pd.read_csv('data/frog_tongue_adhesion.csv', comment='#')
# Have a look so we remember
df.head()
HoloViews is an excellent high-level plotting package that can use Bokeh to render plots. In my view, it is one of the best high-level plotting languaes in the Python plotting landscape. We will work with HoloViews in a future lesson, but in this lesson, we will do our high-level plotting using a package I wrote called Bokeh-catplot.
Why are we using this package and not HoloViews? In next year's bootcamp, I am almost certain we will use HoloViews exclusively for high-level plotting because I suspect that most of the functionality in Bokeh-catplot will be incorporated into HoloViews. Bokeh-catplot exists because HoloViews lacks some important functionality. (More on these very important plot types in a moment; don't worry if you don't know what they are just yet.)
There are all relatively minor fixes for HoloViews, which will likely have this functionality in the near future.
Nonetheless, ECDFs and horizontal strip plots are important visualizations and I advocate using them often, and Bokeh-catplot provides a high-level way to render them using Bokeh.
Let us first consider the different kinds of data we may encounter as we think about constructing a plot.
In practice, ordinal data can be cast as quantitative or treated as categorical with an ordering enforced on the categories (e.g., categorical data [1, 2, 3]
becomes ['1', '2', '3']
.). Temporal data can also be cast as quantitative, (e.g., second from the start time). We will therefore focus out attention on quantitative and categorical data.
When we made scatter plots in the previous lesson, both types of data were quantitative. We did actually incorporate categorical information in the form of colors of the glyph (insomniacs and normal sleepers being colored differently) and in tooltips.
But what if we wanted a single type of measurement, say impact force, for each frog. Here, we have the quantitative impact force data and the categorical frog ID data. One of our axes is now categorical.
To demonstrate how to set up a categorical axis with Bokeh, I will make a bar graph of the mean impact force for each of the four frogs. But before I even begin this, I will give you the following piece of advice: Don't make bar graphs. More on that in a moment.
Before we do that, we need to compute the means from the inputted data frame.
df_mean = df.groupby('ID')['impact force (mN)'].mean().reset_index()
# Take a look
df_mean
To set up a categorical axis, you need to specify the x_range
(or y_range
if you want the y-axis to be categorical) as a list with the categories you want on the axis when you instantiate the figure. I will make a horizontal bar graph, so I will specify y_range
. Also, when I instantiate this figure, because it is not very tall and I do not want the reset tool cut off, I will also explicitly set the tools I want in the toolbar.
p = bokeh.plotting.figure(
height=200,
width=400,
x_axis_label='impact force (mN)',
y_range=df_mean['ID'].unique()[::-1],
tools='pan,wheel_zoom,save,reset'
)
Now that we have the figure, we can put the bars on. The p.hbar()
method populates the figure with horizontal bar glyphs. The right
kwarg says what column of the data source dictates how far to the right to show the bar, while the height
kwarg says how think the bars are.
I will also ensure the quantitative axis starts at zero and turn off the grid lines on the categorical axis, which is commonly done.
p.hbar(
source=df_mean,
y='ID',
right='impact force (mN)',
height=0.6
)
# Turn off gridlines on categorical axis
p.ygrid.grid_line_color = None
# Start axes at origin on quantitative axis
p.x_range.start = 0
bokeh.io.show(p)
We similarly make vertical bar graphs specifying x_range
and using p.vbar()
.
p = bokeh.plotting.figure(
height=250,
width=250,
y_axis_label='impact force (mN)',
x_range=df_mean['ID'].unique(),
)
p.vbar(
source=df_mean,
x='ID',
top='impact force (mN)',
width=0.6
)
p.xgrid.grid_line_color = None
p.y_range.start = 0
bokeh.io.show(p)
Generating the bar graphs was not too painful, even tough we used Bokeh, a low-level plotting library. We would like to make plots more directly. We do not want to have to explicitly pre-process the data, set up the categorical axis, etc. We would like to just provide a data set, say which column(s) is/are categorical and which is quantitative, and then just get our plot. As I mentioned before, Holoviews is excellent for this purpose, but we will use Bokeh-catplot as our high-level package for making plots.
Bokeh-catplot generates plots from tidy data frames where one or more columns contains categorical data and the column of interest in the plot is quantitative.
There are four types of plots that Bokeh-catplot can generate. As you will see, all four of these modes of plotting are meant to give a picture about how the quantitative measurements are distributed for each category.
bokeh_catplot.box()
bokeh_catplot.strip()
bokeh_catplot.histogram()
bokeh_catplot.ecdf()
The first three arguments of each of these functions are necessary to build the plot. They are:
data
: A tidy data framecats
: A list of columns in the data frame that are to be considered as categorical variables in the plot. If None
, a single box, strip, histogram, or ECDF is plotted.val
: The column of the data frame to be treated as the quantitative variable. With this in mind, we will put Bokeh-catplot to use on the frog data set.
As I discuss below, bar graphs are almost never a good choice for visualization. You distill all of the information in the data set down to one or two summary statistics, and then use giant glyphs to show them. As a start for improvement, you could distill the data set down to five or so summary statistics and show those graphically, as opposed to just one or two.
Box plots provide such a summary. I will first make one using bokeh_catplot.box()
and then describe how a box plot is interpreted.
p = bokeh_catplot.box(
data=df,
cats='ID',
val='impact force (mN)'
)
bokeh.io.show(p)
The line in the middle of each box is the median and the top and bottom of the box at the 75th and 25th percentiles, respectively. The distance between the 25th and 75th percentiles is called the interquartile range, or IQR. The whiskers of the box plot extend to the most extreme data point within 1.5 times the interquartile range. If any data points are more extreme than the end of the whisker, they are shown individually, and are often referred to as outliers.
A box plot can use a useful visualization if you have many data points and it is difficult to plot them all. I rarely find that there are situations where all data cannot be plotted, either with strip plots of ECDFs, which we will cover in a moment, so I generally do not use box plots. Nonetheless, I do not find them too objectionable, as they effectively display important nonparametric summary statistics of your data set.
Box plots summarize a data set with summary statistics, but what not plot all your data? You work hard to acquire them. You should show them all. This is a mantra to live by.
Let's do that now.
A strip plot is like a scatter plot; it puts a glyph for every measured data point. The only different is that one of the axes is categorical. In this case, you are plotting all of your data.
p = bokeh_catplot.strip(
data=df,
cats='ID',
val='impact force (mN)'
)
bokeh.io.show(p)
This is a good plot to make since you are plotting all of your data, but it does have the problem that you cannot tell if multiple ticks overlap. Let's look at some alternatives.
In plotting all of our data in a strip plot, we can roughly see how the data are distributed. There are more measurements where there are more glyphs.
p = bokeh_catplot.histogram(
data=df,
cats='ID',
val='impact force (mN)'
)
bokeh.io.show(p)
Histograms are typically used to display how data are distributed. As an example I will generate Normally distributed data and plot the histogram. (We will learn how to generate data like this when we study random number generation with NumPy in a future lesson. For not, this is for purposes of discussing plotting options.)
# Generate normally distributed data
np.random.seed(353926)
df_norm = pd.DataFrame(data={'x': np.random.normal(size=500)})
# Plot the histogram
p = bokeh_catplot.histogram(
data=df_norm,
cats=None,
val='x'
)
bokeh.io.show(p)
This looks similar to the standard Normal curve we are used to seeing and is a useful comparison to a probability density function (PDF). However, Histograms suffer from binning bias. By binning the data, you are not plotting all of them. In general, if you can plot all of your data, you should. For that reason, I prefer not to use histograms for studying how data are distributed, but rather prefer to use ECDFs, which enable plotting of all data.
The ECDF evaluated at x for a set of measurements is defined as
\begin{align} \text{ECDF}(x) = \text{fraction of measurements } \le x. \end{align}While the histogram is an attempt to visualize a probability density function (PDF) of a distribution, the ECDF visualizes the cumulative density function (CDF). The CDF, $F(x)$, and PDF, $f(x)$, both completely define a univariate distribution and are related by
\begin{align} f(x) = \frac{\mathrm{d}F}{\mathrm{d}x}. \end{align}The definition of the ECDF is all that you need for interpretation. Once you get used to looking at CDFs, they will become as familiar as PDFs. A peak in a PDF corresponds to an inflection point in a CDF.
To make this more clear, let us look at plot of a PDF and ECDF for familiar distributions, the Gaussian and Binomial.
Now that we know how to interpret ECDFs, lets plot the ECDF for our dummy Normally-distributed data.
p = bokeh_catplot.ecdf(
data=df_norm,
cats=None,
val='x'
)
bokeh.io.show(p)
Now that we understand what an ECDF is and how to plot it, let's make a set of ECDFs for our frog data.
p = bokeh_catplot.ecdf(
data=df,
cats='ID',
val='impact force (mN)'
)
bokeh.io.show(p)
Each dot in the ECDF is a single data point that we measured. Given the above definition of the ECDF, it is defined for all real $x$. So, formally, the ECDF is a continuous function (with discontinuous derivatives at each data point). So, it should be plotted like a staircase according to the formal definition. We can plot it that way using the formal
keyword argument.
p = bokeh_catplot.ecdf(
data=df,
cats='ID',
val='impact force (mN)',
formal=True
)
bokeh.io.show(p)
This is still plotting all of your data. The concave corners of the staircase correspond to the measured data. This can be seen by overlaying the "dot" version of the ECDFs.
p = bokeh_catplot.ecdf(
data=df,
cats='ID',
val='impact force (mN)',
p=p
)
bokeh.io.show(p)
You may have noticed in the discussion of ECDFs that I introduced some new keyword arguments, formal
and p
. In fact, each of the four plotting functions also has the following additional optional keyword arguments.
palette
: A list of hex colors to use for coloring the markers for each category. By default, it uses the default color scheme of Vega-Lite.order
: If specified, the ordering of the categories to use on the categorical axis and legend (if applicable). Otherwise, the order of the inputted data frame is used.p
: If specified, the bokeh.plotting.Figure
object to use for the plot. If not specified, a new figure is created.The respective plotting functions also have kwargs that are specific to each (such as formal
for bokeh_catplot.ecdf()
. Examples highlighting some, but not all, customizations follow. You can find out what kwargs are available for each function by reading their doc strings, e.g., with
bokeh_catplot.box?
Any kwargs not in the function call signature are passed to bokeh.plotting.figure()
when the figure is instantiated.
We can also have horizontal box plots.
p = bokeh_catplot.box(
data=df,
cats='ID',
val='impact force (mN)',
horizontal=True
)
bokeh.io.show(p)
We can independently specify properties of the marks using box_kwargs
, whisker_kwargs
, median_kwargs
, and outlier_kwargs
. For example, say we wanted our colors to be Betancourt red, and that we wanted the outliers to also be that color and use diamond glyphs.
p = bokeh_catplot.box(
data=df,
cats='ID',
val='impact force (mN)',
whisker_caps=True,
outlier_marker='diamond',
box_kwargs=dict(fill_color='#7C0000'),
whisker_kwargs=dict(line_color='#7C0000', line_width=2),
)
bokeh.io.show(p)
To help alleviate the overlap problem, we can make a strip plot with dash markers and add some transparency.
p = bokeh_catplot.strip(
data=df,
cats='ID',
val='impact force (mN)',
marker='dash',
marker_kwargs=dict(alpha=0.5)
)
bokeh.io.show(p)
The problem with strip plots is that they can have trouble with overlapping data points. A common approach to deal with this is to "jitter," or place the glyphs with small random displacements along the categorical axis. I do that here, allowing for hover tools that give more information about the respective data points.
p = bokeh_catplot.strip(
data=df,
cats='ID',
val='impact force (mN)',
jitter=True,
tooltips=[
('trial', '@{trial number}'),
('adh force', '@{adhesive force (mN)}')
]
)
bokeh.io.show(p)
With any of the plots, you can have more than one categorical column, and the categorical axes are nicely spaced and formatted. Here, we'll categorize by frog ID and by trial number.
p = bokeh_catplot.strip(
data=df,
cats=['ID', 'trial number'],
val='impact force (mN)',
color_column='trial number',
width=550
)
bokeh.io.show(p)
Even while plotting all of the data, we sometimes want to graphically display summary statistics, in which case overlaying a box plot and a jitter plot is useful. To populate an existing Bokeh figure with new glyphs from another catplot, pass in the p
kwarg. You should be careful, though, because you need to make sure the cats
, val
, and horizontal
arguments exactly match.
p = bokeh_catplot.strip(
data=df,
cats='ID',
val='impact force (mN)',
horizontal=True,
jitter=True,
height=250
)
p = bokeh_catplot.box(
data=df,
cats='ID',
val='impact force (mN)',
horizontal=True,
whisker_caps=True,
display_points=False,
box_kwargs=dict(fill_color=None, line_color='gray'),
median_kwargs=dict(line_color='gray'),
whisker_kwargs=dict(line_color='gray'),
p=p,
)
bokeh.io.show(p)
We could plot normalized histograms using the density kwarg, and we'll make the plot a little wider to support the legend.
# Plot the histogram
p = bokeh_catplot.histogram(
data=df_norm,
cats=None,
val='x',
density=True,
)
bokeh.io.show(p)
Instead of plotting a separate ECDF for each category, we can put all of the categories together on one ECDF and color the points by the categorical variable by using the kind='colored'
kwarg. Note that if we do this, we can only have the "dot" style ECDF, not the formal staircase.
p = bokeh_catplot.ecdf(
data=df,
cats='ID',
val='impact force (mN)',
kind='colored',
)
bokeh.io.show(p)
Bar graphs, especially with error bars (in which case they are called dynamite plots), are typically awful. They are pervasive in biology papers. I have yet to find a single example where a bar graph is the best choice. Strip plots (with jitter) or even box plots, are more informative and almost always preferred. In fact, ECDFs are often better even than these. Here is a simple message:
What should I do instead you ask? The answer is simple: plot all of your data when you can. If you can't, box plots are always better than bar graphs.
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,bokeh_catplot,jupyterlab