Lesson 34 High level plotting with HoloViews

(c) 2019 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This lesson was generated from a Jupyter notebook. You can download the notebook here.


In [1]:
import numpy as np
import scipy.special
import pandas as pd

import bootcamp_utils.hv_defaults

import bokeh.palettes

import holoviews as hv
import holoviews.operation.datashader as hvds

bokeh.io.output_notebook()
hv.extension('bokeh')
Loading BokehJS ...

Introduction to HoloViews

HoloViews is a high-level plotting library that is part of the PyViz ecosystem. It allows specification of plots, and is agnostic about what is used to render them. We will use Bokeh as our renderer. Importantly, HoloViews provides convenient access to Datashader.

To set this up, we import HoloViews (as hv) and then set the Holoviews extension to be Bokeh using hv.extension('bokeh') at the top of the notebook. We also need to import

Main ideas behind HoloViews

Imagine you have a tidy data set (and HoloViews really only works with tidy data sets). It is already logically organized; each row is an observation and each column a variable. Let us think for a moment conceptually (that is, not in terms of steps of coding) about how we might make a scatter plot from a tidy data frame. We need to (obviously) first decide that we want to make a scatter plot, i.e., we specify what kind of graphic element we want to convert our data set into. Then, we need to annotate the columns of the data frame. That is, we need to annotate which column will determine the x-coordinate of the glyphs in the scatter plot and which will determine the y-coordinate of the glyphs. After we have made these decisions, that is, what kind of graphic element we want to produce and what columns give the x-coordinates and what gives the y-coordinates, the fundamental plot is complete. Everything else is visual styling.

The philosophy of HoloViews, right on the front of the webpage, is "Stop plotting your data—annotate your data and let it visualize itself." With HoloViews, you add minimal annotations to your (tidy; must be tidy!) data to enable visualization. You can then later stylize the visualization, but the annotation is sufficient to describe the plot. Specifically, the annotations you need are:

  1. What kind of plotting element are you making (e.g., scatter, box-and-whisker, heat map, etc.).
  2. What columns specify the dimensions of the data, needed to set up axes.

Once you make those annotations, HoloViews can take care of the rendering, using either Matplotlib, Bokeh, or Plotly. The main idea is that HoloViews objects are conceptual, agnostic to the particulars of rendering. You can stylize the rending if you like, but the fundamentals of the plotting object are already set by the annotation.

Importing HoloViews and choosing a renderer

HoloViews is imported as hv, which we have done in the cell at the top of this notebook. Because HoloViews is agnostic to the ultimate renderer, we need to specify an extension, which we did above by excuting hv.extension('bokeh'). Our plots will now be rendered using Bokeh.

Note that you must install the appropriate JupyterLab extension to view HoloViews plots. You did this in Lesson 0 with

jupyter labextension install @pyviz/jupyterlab_pyviz

An example: A scatter plot of finch beak lengths and depths

As an example of use of HoloViews, we will again visit the Grant and Grant finch beak data. We will load it in and take a look.

In [2]:
df = pd.read_csv('data/grant_complete.csv')
df.head()
Out[2]:
band beak depth (mm) beak length (mm) species year
0 20123 8.05 9.25 fortis 1973
1 20126 10.45 11.35 fortis 1973
2 20128 9.55 10.15 fortis 1973
3 20129 8.75 9.95 fortis 1973
4 20133 10.15 11.55 fortis 1973

We will not make a plot and explain how the syntax relates to the ideas behind annotating data sets. We will make a simple scatter plot of the beak length vs beak depth for all birds measured in 2012.

In [3]:
df_2012 = df.loc[df['year']==2012, :]

hv.Points(
    data=df_2012,
    kdims=['beak length (mm)', 'beak depth (mm)'],
    vdims=['species'],
)
Out[3]:

Specification of the element type

We used hv.Points to invoke an element of visualization. An element is just a way of converting the tabular nature of the data to a graphical representation, in this case a scatter plot of points. That is, we want to make a plot where each glyph lies in a two-dimensional plot and the values of both the x- and y-axes are independent. (This is contrasted with hv.Scatter in which the x-coordinate is the independent variable and the y-coordinate is dependent on x; hv.Points is more appropriate here.)

The available element types may be found in the HoloViews reference gallery.

Specification of dimensions

There are two types of dimensions, key dimensions and value dimensions, specified with the kdims and vdims arguments, respectively. You can think of key and value dimensions like keys and values of a dictionary. You can think of these like key-value pairs in dictionaries (where you can have multidimensional keys). Key dimensions are indexing dimensions, which say where on the graphic the data in a row will reside. The value dimensions give information about each data point. In the simple plot above, the key dimensions are the the beak length and beak depth. Those columns determined where the glyphs were placed.

We additionally had a value dimension, specified by vdims, which has additional information associated with each data point. This information was not used in the above plot, but we will put it to use momentarily.

Stylizing plots

After a plotting Element is specified, we can stylize it using the hv.opts functionality. To investigate what styling options are available for each kind of plotting Element, you can enter, for example

hv.help(hv.Points)

and you will get detailed information on what options are available for stylizing hv.Points elements.

I find the HoloViews defaults not very pleasing. If you agree and want to define defaults for an entire document, you may do so using hv.opts.defaults(). I have made some defaults that I find more pleasing that are available in the bootcamp_utils.hv_defaults.set_defaults() function. Let's set those defaults (which will be active for the rest of the notebook), and see how our plot looks.

In [4]:
bootcamp_utils.hv_defaults.set_defaults()

hv.Points(
    data=df_2012,
    kdims=['beak length (mm)', 'beak depth (mm)'],
    vdims=['species'],
)
Out[4]:

Grouping by value dimensions

Recall that we have an unused value dimension in the element we created. We would naturally like to separate out the glyphs by species. To do this, we can do a groupby operation on the Element. That's right, we can do groupby operations on graphical elements!

In [5]:
hv.Points(
    data=df_2012,
    kdims=['beak length (mm)', 'beak depth (mm)'],
    vdims=['species'],
).groupby(
    'species'
)
Out[5]:

We now have a pull down menu to the right of the plot where we can select the species we want and the glyphs on the plot will adjust accordingly. By default, after applying the groupby operation, HoloViews gives us a HoloMap object. The column we used to group by are now selectable through a graphical interface (a pull-down menu).

We may instead with to group by species and lay the plots out next to each other, creating a layout. We can use the layout() method do to this.

In [6]:
hv.Points(
    data=df_2012,
    kdims=['beak length (mm)', 'beak depth (mm)'],
    vdims=['species'],
).groupby(
    'species'
).layout(
)
Out[6]:

Finally, we may wish to overlay the plots for each species that we split by species.

In [7]:
hv.Points(
    data=df_2012,
    kdims=['beak length (mm)', 'beak depth (mm)'],
    vdims=['species'],
).groupby(
    'species'
).overlay(
)
Out[7]:

HoloViews was kind enough to automatically provide us with a legend!

Further stylizing

As an example of how to use the .opts() method to stylize a plot, we can use .opts() to add tooltips where we can hover and get additional information from the vdims.

In [8]:
hv.Points(
    data=df_2012,
    kdims=['beak length (mm)', 'beak depth (mm)'],
    vdims=['species'],
).groupby(
    'species'
).opts(
    tools=['hover']
).overlay(
)
Out[8]:

As a final example of constructing this plot, let's consider the entire data set and allow the year to be selected via a HoloMap, but color by species for each year. (We have to select show_legend=False because of a bug in laying out HoloMaps with legends.)

In [11]:
hv.Points(
    data=df,
    kdims=['beak length (mm)', 'beak depth (mm)'],
    vdims=['species', 'year'],
).groupby(
    ['species', 'year'],
).opts(
    tools=['hover'],
    show_legend=False,
).overlay(
    'species',
)
Out[11]:

Extracting the Bokeh plotting object

After making and displaying a HoloViews plot, we might want to get the Bokeh figure. We can extract that using hv.render().

In [12]:
hv_fig = hv.Points(
    data=df,
    kdims=['beak length (mm)', 'beak depth (mm)'],
    vdims=['species', 'year'],
).groupby(
    ['species', 'year'],
).opts(
    tools=['hover'],
    show_legend=False,
).overlay(
    'species',
)

# Take out the Bokeh object
p = hv.render(hv_fig)

# Display using Bokeh
bokeh.io.show(p)

Note that we got the plot for 1973, which was the first year offered by the interactive HoloMap. If we wanted another year, we would have to make a plot specifically for that year.

Other kinds of plots

We have seen the basics of how HoloViews works for a scatter plot specified by hv.Points. We now show some other kinds of plots we have encountered until now.

Smooth function

HoloViews can plot a smooth function using the hv.Curve. For a Curve, there is one key dimension, which is the independent variable, and one value dimension, which is the dependent variable. This is to be contrasted with hv.Path, which has two key dimensions, meaning that neither of the variables is strictly dependent on the other.

Here is a HoloViews plot of the x-section of the Airy disk. We can either provide a data frame with columns, or we can provide a 2-tuple of NumPy arrays that serve as the dependent and independent variable, respectively.

In [13]:
# The x-values we want
x = np.linspace(-15, 15, 400)

# The normalized intensity
norm_I = 4 * (scipy.special.j1(x) / x)**2

hv.Curve(
    data=(x, norm_I),
    kdims='x',
    vdims='normalized intensity'
)
Out[13]:

Box plot

Box plots are made using hv.BoxWhisker elements. If multiple key dimensions are specified, nested categorical axes are automatically set up.

Note: HoloView's box plots currently uses a non-canonical definition of the whisker length that will be adjusted in a future version.

In [14]:
hv.BoxWhisker(
    data=df,
    kdims=['species', 'year'],
    vdims=['beak depth (mm)'],
).opts(
    box_color='species'
)
Out[14]:

Strip plots

We use hv.Scatter() to generate strip plots. When we specify the jitter kwargs, we specify the width of the jitter.

Note that nested categorical axes are currently (as of July 19, 2019) only supported for box, violin, and bar plots, as per the docs but will eventually be supported for many more plot types, including Scatter, which are used to generate strip plots.

In [15]:
# Make the year column a string to can use as categorical
df['year_str'] = df['year'].astype(str)

hv.Scatter(
    data=df,
    kdims=[('year_str', 'year')],
    vdims=['beak depth (mm)', 'species'],
).groupby(
    'species'
).opts(
    color='species',
    cmap=bootcamp_utils.hv_defaults.default_cmap,
    jitter=0.4,
    show_legend=False,
    width=400,
    height=250,
).layout(
)
Out[15]:

Histograms

When making a histogram, the values of the bin edges and counts must be computed beforehand using np.histogram().

In [16]:
edges, counts = np.histogram(df_2012['beak depth (mm)'], bins=int(np.sqrt(len(df_2012))))

We then can pass the bin edges and counts into hv.Histogram().

In [17]:
bootcamp_utils.hv_defaults.set_defaults()
hv.Histogram(
    data=(edges, counts),
    kdims='beak depth (mm)'
)
Out[17]:

ECDFs

HoloViews does not have native support for ECDFs, but we can create ECDFs in a data frame and use hv.Scatter to make a plot of an ECDF.

In [18]:
def ecdf_transform(data):
    return data.rank(method="first") / len(data)

df_2012.loc[:, "beak depth ECDF"] = df_2012.groupby("species")[
    "beak depth (mm)"
].transform(ecdf_transform)
/Users/Justin/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:362: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
/Users/Justin/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:543: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

After supplying the y-values for the ECDF, we plot with hv.Scatter.

In [19]:
hv.Scatter(
    data=df_2012,
    kdims='beak depth (mm)',
    vdims=[('beak depth ECDF', 'ECDF'), 'species'],
).groupby(
    'species'
).overlay(
)
Out[19]:

Large data sets

We are often faced with data sets that have too many points to plot and when overlayed, the glyphs obscure each other. One strategy to deal with this is to specify transparency of the glyphs so we can visualize where they are dense and where they are sparse.

As an example, we will consider a flow cytometry data set consisting of 100,000 data points. The data appeared in Razo-Mejia, et al., Cell Systems, 2018. First, we'll read in the data and take a look.

In [20]:
df = pd.read_csv('data/20160804_wt_O2_HG104_0uMIPTG.csv', comment='#', index_col=0)

df.head()
Out[20]:
HDR-T FSC-A FSC-H FSC-W SSC-A SSC-H SSC-W FITC-A FITC-H FITC-W APC-Cy7-A APC-Cy7-H APC-Cy7-W
0 0.418674 6537.148438 6417.625000 133513.125000 24118.714844 22670.142578 139447.218750 11319.865234 6816.254883 217673.406250 55.798954 255.540833 28620.398438
1 2.563462 6402.215820 5969.625000 140570.171875 23689.554688 22014.142578 141047.390625 1464.151367 5320.254883 36071.437500 74.539612 247.540833 39468.460938
2 4.921260 5871.125000 5518.852539 139438.421875 16957.433594 17344.511719 128146.859375 5013.330078 7328.779785 89661.203125 -31.788519 229.903214 -18123.212891
3 5.450112 6928.865723 8729.474609 104036.078125 13665.240234 11657.869141 153641.312500 879.165771 6997.653320 16467.523438 118.226028 362.191162 42784.375000
4 9.570750 11081.580078 6218.314453 233581.765625 43528.683594 22722.318359 251091.968750 2271.960693 9731.527344 30600.585938 20.693352 210.486893 12885.928711

We can make a scatter plot of the front versus side scattering (FSC-A vs. SSC-A) of each measurement, as is often done. We will only use 5,000 data points so as not to choke the browser with all 100,000 (though we of course want to plot all 100,000). The difficulty with large data sets will become clear.

In [21]:
hv.Points(
    data=df.iloc[::20, :],
    kdims=['SSC-A', 'FSC-A'],
).opts(
    logx=True,
    logy=True,
    fill_alpha=0.05,
    line_alpha=0,
    fill_color='dodgerblue',
    padding=0.05
)
Out[21]:

The transparency helps us see where the density is, but it washes out all of the detail for points away from dense regions. There is the added problem that we cannot populate the plot with too many glyphs, so we can't plot all of our data.

HoloView's integration with DataShader allows us to plot all points for millions to billions of points (so 100,000 is a piece of cake!). It works like Google Maps: it displays raster images on the plot that show the level of detail of the data points appropriate for the level of zoom. It adjusts the image as you interact with the plot, so the browser never gets his with a large number of individual glyphs to render. Furthermore, it shades the color of the data points according to the local density.

Let's make a datashaded version of this plot. Note, though, that HoloViews currently cannot display Datashaded plots with a log axis, so we have to manually compute the logarithms for the data set.

We start by making an hv.Points element, but do not render it.

In [22]:
# Compute log of scattering
df[['log SSC-A', 'log FSC-A']] = df[['SSC-A', 'FSC-A']].apply(np.log10)

# Generate HoloViews Points Element
points = hv.Points(
    data=df,
    kdims=['log SSC-A', 'log FSC-A'],
)

After we make the points element, we can Datashade it using holoviews.operation.datashader.datashade(), which I have imported to use as as hvds.datashade(). We should also apply dynamic speading which makes the size of each point bigger than a single pixel, which might be too small to see on the screen.

In [23]:
# Datashade with spreading of points
hvds.dynspread(
    hvds.datashade(
        points, 
        cmap=bootcamp_utils.hv_defaults.datashader_blues,
    )
).opts(
    width=350, 
    height=300, 
    padding=0.05,
    show_grid=True,
)
Out[23]:

We can zoom in and out of the image, and the raster gets re-rendered (but only in a live notebook; this does not work in the static HTML).

Decimating

An alternative to datashading is decimating. When we decimate, HoloViews automatically resamples the data to only display a maximum number of data points, given by the max_samples kwarg below. This puts a cap on the number of data points the browser has to render.

I generally prefer datashading because when you decimate, you do not plot all of your data, and which data points get omitted is random.

In [24]:
hv.operation.decimate(
    points,
    max_samples=1000
)
Out[24]:

Datashaded random walk

Datashader also works on lines and paths. As an example, here is a visualization of the scale invariance of random walks using Datashader.

In [25]:
# Make a random walk of 1 million steps
n_steps = 1000000
theta = np.random.uniform(low=0, high=2*np.pi, size=n_steps)
x = np.cos(theta).cumsum()
y = np.sin(theta).cumsum()

hvds.dynspread(
    hvds.datashade(
        hv.Path(
            data=(x, y),
            kdims=['x', 'y'],
        ),
        cmap=bootcamp_utils.hv_defaults.datashader_blues,
    ).opts(
        height=300,
        width=350,
        show_grid=True,
    )
)
Out[25]:

Computing environment

In [27]:
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,bokeh,holoviews,datashader,jupyterlab
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
CPython 3.7.3
IPython 7.1.1

numpy 1.16.4
scipy 1.2.1
pandas 0.24.2
bokeh 1.2.0
holoviews 1.12.3
datashader 0.7.0
jupyterlab 0.35.5