(c) 2019 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This lesson was generated from a Jupyter notebook. You can download the notebook here.
import numpy as np
import scipy.special
import pandas as pd
import bootcamp_utils.hv_defaults
import bokeh.palettes
import holoviews as hv
import holoviews.operation.datashader as hvds
bokeh.io.output_notebook()
hv.extension('bokeh')
HoloViews is a high-level plotting library that is part of the PyViz ecosystem. It allows specification of plots, and is agnostic about what is used to render them. We will use Bokeh as our renderer. Importantly, HoloViews provides convenient access to Datashader.
To set this up, we import HoloViews (as hv
) and then set the Holoviews extension to be Bokeh using hv.extension('bokeh')
at the top of the notebook. We also need to import
Imagine you have a tidy data set (and HoloViews really only works with tidy data sets). It is already logically organized; each row is an observation and each column a variable. Let us think for a moment conceptually (that is, not in terms of steps of coding) about how we might make a scatter plot from a tidy data frame. We need to (obviously) first decide that we want to make a scatter plot, i.e., we specify what kind of graphic element we want to convert our data set into. Then, we need to annotate the columns of the data frame. That is, we need to annotate which column will determine the x-coordinate of the glyphs in the scatter plot and which will determine the y-coordinate of the glyphs. After we have made these decisions, that is, what kind of graphic element we want to produce and what columns give the x-coordinates and what gives the y-coordinates, the fundamental plot is complete. Everything else is visual styling.
The philosophy of HoloViews, right on the front of the webpage, is "Stop plotting your data—annotate your data and let it visualize itself." With HoloViews, you add minimal annotations to your (tidy; must be tidy!) data to enable visualization. You can then later stylize the visualization, but the annotation is sufficient to describe the plot. Specifically, the annotations you need are:
Once you make those annotations, HoloViews can take care of the rendering, using either Matplotlib, Bokeh, or Plotly. The main idea is that HoloViews objects are conceptual, agnostic to the particulars of rendering. You can stylize the rending if you like, but the fundamentals of the plotting object are already set by the annotation.
HoloViews is imported as hv
, which we have done in the cell at the top of this notebook. Because HoloViews is agnostic to the ultimate renderer, we need to specify an extension, which we did above by excuting hv.extension('bokeh')
. Our plots will now be rendered using Bokeh.
Note that you must install the appropriate JupyterLab extension to view HoloViews plots. You did this in Lesson 0 with
jupyter labextension install @pyviz/jupyterlab_pyviz
As an example of use of HoloViews, we will again visit the Grant and Grant finch beak data. We will load it in and take a look.
df = pd.read_csv('data/grant_complete.csv')
df.head()
We will not make a plot and explain how the syntax relates to the ideas behind annotating data sets. We will make a simple scatter plot of the beak length vs beak depth for all birds measured in 2012.
df_2012 = df.loc[df['year']==2012, :]
hv.Points(
data=df_2012,
kdims=['beak length (mm)', 'beak depth (mm)'],
vdims=['species'],
)
We used hv.Points
to invoke an element of visualization. An element is just a way of converting the tabular nature of the data to a graphical representation, in this case a scatter plot of points. That is, we want to make a plot where each glyph lies in a two-dimensional plot and the values of both the x- and y-axes are independent. (This is contrasted with hv.Scatter
in which the x-coordinate is the independent variable and the y-coordinate is dependent on x; hv.Points
is more appropriate here.)
The available element types may be found in the HoloViews reference gallery.
There are two types of dimensions, key dimensions and value dimensions, specified with the kdims
and vdims
arguments, respectively. You can think of key and value dimensions like keys and values of a dictionary. You can think of these like key-value pairs in dictionaries (where you can have multidimensional keys). Key dimensions are indexing dimensions, which say where on the graphic the data in a row will reside. The value dimensions give information about each data point. In the simple plot above, the key dimensions are the the beak length and beak depth. Those columns determined where the glyphs were placed.
We additionally had a value dimension, specified by vdims
, which has additional information associated with each data point. This information was not used in the above plot, but we will put it to use momentarily.
After a plotting Element is specified, we can stylize it using the hv.opts
functionality. To investigate what styling options are available for each kind of plotting Element, you can enter, for example
hv.help(hv.Points)
and you will get detailed information on what options are available for stylizing hv.Points
elements.
I find the HoloViews defaults not very pleasing. If you agree and want to define defaults for an entire document, you may do so using hv.opts.defaults()
. I have made some defaults that I find more pleasing that are available in the bootcamp_utils.hv_defaults.set_defaults()
function. Let's set those defaults (which will be active for the rest of the notebook), and see how our plot looks.
bootcamp_utils.hv_defaults.set_defaults()
hv.Points(
data=df_2012,
kdims=['beak length (mm)', 'beak depth (mm)'],
vdims=['species'],
)
Recall that we have an unused value dimension in the element we created. We would naturally like to separate out the glyphs by species. To do this, we can do a groupby
operation on the Element. That's right, we can do groupby operations on graphical elements!
hv.Points(
data=df_2012,
kdims=['beak length (mm)', 'beak depth (mm)'],
vdims=['species'],
).groupby(
'species'
)
We now have a pull down menu to the right of the plot where we can select the species we want and the glyphs on the plot will adjust accordingly. By default, after applying the groupby operation, HoloViews gives us a HoloMap object. The column we used to group by are now selectable through a graphical interface (a pull-down menu).
We may instead with to group by species and lay the plots out next to each other, creating a layout. We can use the layout()
method do to this.
hv.Points(
data=df_2012,
kdims=['beak length (mm)', 'beak depth (mm)'],
vdims=['species'],
).groupby(
'species'
).layout(
)
Finally, we may wish to overlay the plots for each species that we split by species.
hv.Points(
data=df_2012,
kdims=['beak length (mm)', 'beak depth (mm)'],
vdims=['species'],
).groupby(
'species'
).overlay(
)
HoloViews was kind enough to automatically provide us with a legend!
As an example of how to use the .opts()
method to stylize a plot, we can use .opts()
to add tooltips where we can hover and get additional information from the vdims.
hv.Points(
data=df_2012,
kdims=['beak length (mm)', 'beak depth (mm)'],
vdims=['species'],
).groupby(
'species'
).opts(
tools=['hover']
).overlay(
)
As a final example of constructing this plot, let's consider the entire data set and allow the year to be selected via a HoloMap, but color by species for each year. (We have to select show_legend=False
because of a bug in laying out HoloMaps with legends.)
hv.Points(
data=df,
kdims=['beak length (mm)', 'beak depth (mm)'],
vdims=['species', 'year'],
).groupby(
['species', 'year'],
).opts(
tools=['hover'],
show_legend=False,
).overlay(
'species',
)
After making and displaying a HoloViews plot, we might want to get the Bokeh figure. We can extract that using hv.render()
.
hv_fig = hv.Points(
data=df,
kdims=['beak length (mm)', 'beak depth (mm)'],
vdims=['species', 'year'],
).groupby(
['species', 'year'],
).opts(
tools=['hover'],
show_legend=False,
).overlay(
'species',
)
# Take out the Bokeh object
p = hv.render(hv_fig)
# Display using Bokeh
bokeh.io.show(p)
Note that we got the plot for 1973, which was the first year offered by the interactive HoloMap. If we wanted another year, we would have to make a plot specifically for that year.
We have seen the basics of how HoloViews works for a scatter plot specified by hv.Points
. We now show some other kinds of plots we have encountered until now.
HoloViews can plot a smooth function using the hv.Curve
. For a Curve, there is one key dimension, which is the independent variable, and one value dimension, which is the dependent variable. This is to be contrasted with hv.Path
, which has two key dimensions, meaning that neither of the variables is strictly dependent on the other.
Here is a HoloViews plot of the x-section of the Airy disk. We can either provide a data frame with columns, or we can provide a 2-tuple of NumPy arrays that serve as the dependent and independent variable, respectively.
# The x-values we want
x = np.linspace(-15, 15, 400)
# The normalized intensity
norm_I = 4 * (scipy.special.j1(x) / x)**2
hv.Curve(
data=(x, norm_I),
kdims='x',
vdims='normalized intensity'
)
Box plots are made using hv.BoxWhisker
elements. If multiple key dimensions are specified, nested categorical axes are automatically set up.
Note: HoloView's box plots currently uses a non-canonical definition of the whisker length that will be adjusted in a future version.
hv.BoxWhisker(
data=df,
kdims=['species', 'year'],
vdims=['beak depth (mm)'],
).opts(
box_color='species'
)
We use hv.Scatter()
to generate strip plots. When we specify the jitter
kwargs, we specify the width of the jitter.
Note that nested categorical axes are currently (as of July 19, 2019) only supported for box, violin, and bar plots, as per the docs but will eventually be supported for many more plot types, including Scatter
, which are used to generate strip plots.
# Make the year column a string to can use as categorical
df['year_str'] = df['year'].astype(str)
hv.Scatter(
data=df,
kdims=[('year_str', 'year')],
vdims=['beak depth (mm)', 'species'],
).groupby(
'species'
).opts(
color='species',
cmap=bootcamp_utils.hv_defaults.default_cmap,
jitter=0.4,
show_legend=False,
width=400,
height=250,
).layout(
)
When making a histogram, the values of the bin edges and counts must be computed beforehand using np.histogram()
.
edges, counts = np.histogram(df_2012['beak depth (mm)'], bins=int(np.sqrt(len(df_2012))))
We then can pass the bin edges and counts into hv.Histogram()
.
bootcamp_utils.hv_defaults.set_defaults()
hv.Histogram(
data=(edges, counts),
kdims='beak depth (mm)'
)
HoloViews does not have native support for ECDFs, but we can create ECDFs in a data frame and use hv.Scatter
to make a plot of an ECDF.
def ecdf_transform(data):
return data.rank(method="first") / len(data)
df_2012.loc[:, "beak depth ECDF"] = df_2012.groupby("species")[
"beak depth (mm)"
].transform(ecdf_transform)
After supplying the y-values for the ECDF, we plot with hv.Scatter
.
hv.Scatter(
data=df_2012,
kdims='beak depth (mm)',
vdims=[('beak depth ECDF', 'ECDF'), 'species'],
).groupby(
'species'
).overlay(
)
We are often faced with data sets that have too many points to plot and when overlayed, the glyphs obscure each other. One strategy to deal with this is to specify transparency of the glyphs so we can visualize where they are dense and where they are sparse.
As an example, we will consider a flow cytometry data set consisting of 100,000 data points. The data appeared in Razo-Mejia, et al., Cell Systems, 2018. First, we'll read in the data and take a look.
df = pd.read_csv('data/20160804_wt_O2_HG104_0uMIPTG.csv', comment='#', index_col=0)
df.head()
We can make a scatter plot of the front versus side scattering (FSC-A vs. SSC-A) of each measurement, as is often done. We will only use 5,000 data points so as not to choke the browser with all 100,000 (though we of course want to plot all 100,000). The difficulty with large data sets will become clear.
hv.Points(
data=df.iloc[::20, :],
kdims=['SSC-A', 'FSC-A'],
).opts(
logx=True,
logy=True,
fill_alpha=0.05,
line_alpha=0,
fill_color='dodgerblue',
padding=0.05
)
The transparency helps us see where the density is, but it washes out all of the detail for points away from dense regions. There is the added problem that we cannot populate the plot with too many glyphs, so we can't plot all of our data.
HoloView's integration with DataShader allows us to plot all points for millions to billions of points (so 100,000 is a piece of cake!). It works like Google Maps: it displays raster images on the plot that show the level of detail of the data points appropriate for the level of zoom. It adjusts the image as you interact with the plot, so the browser never gets his with a large number of individual glyphs to render. Furthermore, it shades the color of the data points according to the local density.
Let's make a datashaded version of this plot. Note, though, that HoloViews currently cannot display Datashaded plots with a log axis, so we have to manually compute the logarithms for the data set.
We start by making an hv.Points
element, but do not render it.
# Compute log of scattering
df[['log SSC-A', 'log FSC-A']] = df[['SSC-A', 'FSC-A']].apply(np.log10)
# Generate HoloViews Points Element
points = hv.Points(
data=df,
kdims=['log SSC-A', 'log FSC-A'],
)
After we make the points element, we can Datashade it using holoviews.operation.datashader.datashade()
, which I have imported to use as as hvds.datashade()
. We should also apply dynamic speading which makes the size of each point bigger than a single pixel, which might be too small to see on the screen.
# Datashade with spreading of points
hvds.dynspread(
hvds.datashade(
points,
cmap=bootcamp_utils.hv_defaults.datashader_blues,
)
).opts(
width=350,
height=300,
padding=0.05,
show_grid=True,
)
We can zoom in and out of the image, and the raster gets re-rendered (but only in a live notebook; this does not work in the static HTML).
An alternative to datashading is decimating. When we decimate, HoloViews automatically resamples the data to only display a maximum number of data points, given by the max_samples
kwarg below. This puts a cap on the number of data points the browser has to render.
I generally prefer datashading because when you decimate, you do not plot all of your data, and which data points get omitted is random.
hv.operation.decimate(
points,
max_samples=1000
)
Datashader also works on lines and paths. As an example, here is a visualization of the scale invariance of random walks using Datashader.
# Make a random walk of 1 million steps
n_steps = 1000000
theta = np.random.uniform(low=0, high=2*np.pi, size=n_steps)
x = np.cos(theta).cumsum()
y = np.sin(theta).cumsum()
hvds.dynspread(
hvds.datashade(
hv.Path(
data=(x, y),
kdims=['x', 'y'],
),
cmap=bootcamp_utils.hv_defaults.datashader_blues,
).opts(
height=300,
width=350,
show_grid=True,
)
)
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,bokeh,holoviews,datashader,jupyterlab