Lesson 43: Dealing with overplotting
[1]:
import numpy as np
import pandas as pd
import bokeh.io
import bokeh.plotting
import bokeh.util.hex
# HoloViews is a high-level plotting package
import holoviews as hv
import holoviews.operation.datashader
hv.extension('bokeh')
bokeh.io.output_notebook()
We are often faced with data sets that have too many points to plot and when overlayed, the glyphs obscure each other. This is called overplotting. In this lesson, we will explore strategies for dealing with this issue.
The data set
As an example, we will consider a flow cytometry data set consisting of 100,000 data points. The data appeared in Razo-Mejia, et al., Cell Systems, 2018, and the authors provided the data here. We will consider one CSV file from one flow cytometry experiment extracted from this much larger data set, available in the ~/git/bootcamp/data/
folder.
Let’s read in the data and take a look.
[2]:
df = pd.read_csv('data/20160804_wt_O2_HG104_0uMIPTG.csv', comment='#', index_col=0)
df.head()
[2]:
HDR-T | FSC-A | FSC-H | FSC-W | SSC-A | SSC-H | SSC-W | FITC-A | FITC-H | FITC-W | APC-Cy7-A | APC-Cy7-H | APC-Cy7-W | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.418674 | 6537.148438 | 6417.625000 | 133513.125000 | 24118.714844 | 22670.142578 | 139447.218750 | 11319.865234 | 6816.254883 | 217673.406250 | 55.798954 | 255.540833 | 28620.398438 |
1 | 2.563462 | 6402.215820 | 5969.625000 | 140570.171875 | 23689.554688 | 22014.142578 | 141047.390625 | 1464.151367 | 5320.254883 | 36071.437500 | 74.539612 | 247.540833 | 39468.460938 |
2 | 4.921260 | 5871.125000 | 5518.852539 | 139438.421875 | 16957.433594 | 17344.511719 | 128146.859375 | 5013.330078 | 7328.779785 | 89661.203125 | -31.788519 | 229.903214 | -18123.212891 |
3 | 5.450112 | 6928.865723 | 8729.474609 | 104036.078125 | 13665.240234 | 11657.869141 | 153641.312500 | 879.165771 | 6997.653320 | 16467.523438 | 118.226028 | 362.191162 | 42784.375000 |
4 | 9.570750 | 11081.580078 | 6218.314453 | 233581.765625 | 43528.683594 | 22722.318359 | 251091.968750 | 2271.960693 | 9731.527344 | 30600.585938 | 20.693352 | 210.486893 | 12885.928711 |
Each row is a single object (putatively a single cell) that passed through the flow cytometer. Each column corresponds to a variable in the measurement. As a first step in analyzing flow cytometry data, we perform gating in which properly separated cells (as opposed to doublets or whatever other debris goes through the cytometer) are chosen based on their front and side scattering. Typically, we make a plot of front scattering (FSC-A) versus side scattering (SSC-A) on a log-log scale. This is the plot we want to make in this part of the lesson.
Let’s plot
Let’s explore this data using the tools we seen so far. This is just a simple call to HoloViews. We will only use 5,000 data points so as not to choke the browser with all 100,000 (though we of course want to plot all 100,000). The difficulty with large data sets will become clear.
[3]:
p = bokeh.plotting.figure(
frame_height=300,
frame_width=300,
x_axis_label="SSC-A",
y_axis_label="FSC-A",
x_axis_type="log",
y_axis_type="log",
)
source = bokeh.models.ColumnDataSource(df.iloc[::20, :])
p.scatter(source=source, x="SSC-A", y="FSC-A", marker="circle")
bokeh.io.show(p)
Yikes. We see that many of the glyphs are obscuring each other, preventing us from truly visualizing how the data are distributed. We need to explore options for dealing with this overplotting.
Applying transparency
One strategy to deal with overplotting is to specify transparency of the glyphs so we can visualize where they are dense and where they are sparse. We specify transparency with the fill_alpha
and line_alpha
options for the glyphs. An alpha value of zero is completely transparent, and one is completely opaque.
[4]:
p = bokeh.plotting.figure(
frame_height=300,
frame_width=300,
x_axis_label="SSC-A",
y_axis_label="FSC-A",
x_axis_type="log",
y_axis_type="log",
)
source = bokeh.models.ColumnDataSource(df.iloc[::20, :])
p.scatter(source=source, x="SSC-A", y="FSC-A", fill_alpha=0.05, line_alpha=0)
bokeh.io.show(p)
The transparency helps us see where the density is, but it washes out all of the detail for points away from dense regions. There is the added problem that we cannot populate the plot with too many glyphs, so we can’t plot all of our data. We should see alternatives.
Plotting the density of points
We could instead make a hex-tile plot, which is like a two-dimensional histogram. The space in the x-y plane is divided up into hexagons, which are then colored by the number of points that we in each bin.
Bokeh’s implementation of hex tiles does not cleanly allow for logarithmic axes, so we need to compute the logarithms by hand.
[5]:
df[['log₁₀ SSC-A', 'log₁₀ FSC-A']] = df[['SSC-A', 'FSC-A']].apply(np.log10)
Bokeh provides a useful function, bokeh.util.hex.hexbin()
, that takes a set of (x, y) data points, divides the space of the plot up into hexagonal segments, and counts the number of points that lie within each hexagon. The result is a data frame with columns 'q'
and 'r'
that give the centers of the hex tiles as coordinates on an integer-based hexagonal lattice. Column 'counts'
gives the count of points lying within each hexagon. The size of the hexagons are set by the size
argument.
When computing the hexbins, it is important to remove any NaNs.
[6]:
# Side of hexagons, center to vertex
hexsize = 0.05
# Data with NaN's dropped
df_nonan = df[['log₁₀ SSC-A', 'log₁₀ FSC-A']].dropna()
hexbins = bokeh.util.hex.hexbin(df_nonan['log₁₀ SSC-A'], df_nonan['log₁₀ FSC-A'], size=hexsize)
# Convert to column data source
source = bokeh.models.ColumnDataSource(hexbins)
Now, we can make a plot and populate it with the hextile glyphs.
[7]:
p = bokeh.plotting.figure(
frame_height=300,
frame_width=300,
x_axis_label="log₁₀ SSC-A",
y_axis_label="log₁₀ FSC-A",
match_aspect=True,
)
coloring = bokeh.transform.linear_cmap(
"counts", "Viridis256", 0, max(hexbins.counts)
)
p.hex_tile(
source=source,
q="q",
r="r",
size=hexsize,
line_color=coloring,
fill_color=coloring,
)
bokeh.io.show(p)
This plot is useful for seeing the distribution of points, but is not really a plot of all the data. We should strive to do better. Enter DataShader.
Datashader
Datashader allows us to plot all points for millions to billions of points (so 100,000 is a piece of cake!). It works like Google Maps: it displays raster images on the plot that show the level of detail of the data points appropriate for the level of zoom. It adjusts the image as you interact with the plot, so the browser never gets hit with a large number of individual glyphs to render. Furthermore, it shades the color of the data points according to the local density.
To make a datashaded version of this plot directly using Bokeh, we have to write a fair amount of boilerplate code to get interactivity when we zoom in and out of the plot. We will skip this, and instead use the built-in Datashader support of HoloViews. HoloViews is a quite powerful high-level plotting package (we discuss it in an auxiliary lesson). For now, we will treat it more or less as a black box for generating plots of points and lines that can then be datashaded.
Note, though, that as was the case with hex tiles using Bokeh, HoloViews currently cannot display datashaded plots with a log axis, so we have to manually compute the logarithms for the data set.
We start by making an hv.Points
element, which is a scatter plot. It requires two keyword arguments, data
, which provides the data frame with the data to be plotted, and kdims
, which is a list containing two entries corresponding to the columns of the data frame that respectively give the x- and y-coordinates of the points.
[8]:
# Generate HoloViews Points Element
points = hv.Points(
data=df,
kdims=['log₁₀ SSC-A', 'log₁₀ FSC-A'],
)
After we make the points element, we can datashade it using holoviews.operation.datashader.datashade()
. Note that we need to explicitly import holoviews.operation.datashader
(which we did in the first code cell of this notebook) before using it.
[9]:
# Datashade
hv.operation.datashader.datashade(
points,
).opts(
frame_width=250,
frame_height=250,
padding=0.05,
show_grid=True,
)
[9]:
Note that the opts()
method specifies some of the properties of the plot.
In the HTML-rendered view of this notebook, the data set is rendered as a rastered image. In a running notebook, zooming in and out of the image results in re-display of the data as an image that is appropriate for the level of zoom. It is important to remember that actively using Datashader requires a running Python engine.
Note that, unlike with transparency, we can see each data point. This is the power of Datashader. Before we delve into the details, you may have a few questions:
First, what exactly is Datashader?
Datashader is not a plotting package, but a Python library that aggregates data (more on that in a bit) in a way that can then be rendered by your favorite plotting package: Holoviews, Bokeh, and others. This graphic from the Datashader documentation illustrates how all these packages interact with each other:
We see that the data analyst (i.e. you!) can harness the power of Datashader by using HoloViews as a convenient high-level wrapper. While you have the option to interact with Datashader (and Bokeh) directly, this is very cumbersome (HoloViews takes care of this for you) and is generally not necessary for our purposes.
Second, how does Datashader work?
Datashader is extremely powerful with a lot happening under the hood. It’s important to understand how Datashader works if you want to work beyond the default settings. The Datashader pipeline is as follows:
The first two steps are no different from what we’ve seen so far.
We start with a (tidy!) dataset that is well annotated for easy plotting.
We select the scene which we want to ultimately render. We are not yet plotting anything: we are just specifying what we would like to plot. That’s what we were doing when we called
points = hv.Points(
data=df,
kdims=['log₁₀ SSC-A', 'log₁₀ FSC-A'],
)
The next step is where the Datashader magic comes in.
Aggregation is the method by which the data are binned. In the example above, we binned by count, although there are other options. The beauty of Datashader is the this aggregation is re-computed quickly whenever you zoom in on your plot. This allows for rapid exploration of how your data are distributed at any scale.
Color mapping is the process by which the quantitative information of the aggregation is visualized. The default color map used above was varying shades of blue, but we could easily use a different color map by using the
cmap
keyword argument of the datashade operation.
[10]:
# Datashade
hv.operation.datashader.datashade(
points,
cmap=list(bokeh.palettes.Magma10),
).opts(
frame_width=250,
frame_height=250,
padding=0.05,
show_grid=True,
)
[10]:
And lastly comes the plotting.
As mentioned previously, Datashader is not a plotting package, but it computes rasterized images that can then be rendered nearly effortlessly with HoloViews or other plotting packages.
Now that we have a better handle on how Datashader work, let’s play around with more some more options.
Dynamic spreading
On some retina displays, a pixel is very small and seeing an individual point is difficult. To account for this, we can also apply dynamic spreading which makes the size of each point bigger than a single pixel.
[11]:
# Datashade with spreading of points
hv.operation.datashader.dynspread(
hv.operation.datashader.datashade(
points,
)
).opts(
frame_width=250,
frame_height=250,
padding=0.05,
show_grid=True,
)
[11]:
Datashading paths random walk
Datashader also works on lines and paths. In one of the rare cases where we do not use real data, I will use Datashader to visualize the scale invariance of random walks. First, I will generate a random walk of 10 million steps. For each step, the walker take a unit step in a random direction in a 2D plane.
[12]:
n_steps = 10000000
theta = np.random.uniform(low=0, high=2*np.pi, size=n_steps)
x = np.cos(theta).cumsum()
y = np.sin(theta).cumsum()
To plot lines with HoloViews, the syntax is the same as plotting points; we just use hv.Path()
instead of hv.Points()
. If we do not have the data in data frames, we can instead specify the data as a 2-tuple of Numpy arrays and the kdims
keyword argument them becomes the axis labels.
[13]:
path = hv.Path(
data=(x, y),
kdims=['x', 'y']
)
Now, we can datashade it!
[14]:
hv.operation.datashader.datashade(
path
).opts(
frame_height=300,
frame_width=300,
padding=0.05,
show_grid=True,
)
[14]:
Computing environment
[15]:
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,holoviews,datashader,jupyterlab
Python implementation: CPython
Python version : 3.11.9
IPython version : 8.20.0
numpy : 1.26.4
pandas : 2.2.1
bokeh : 3.4.1
holoviews : 1.18.3
datashader: 0.16.2
jupyterlab: 4.0.13