Exercise 4.4: Understanding and building ECDFs


[1]:
import numpy as np
import pandas as pd

import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()
BokehJS 3.1.1 successfully loaded.

As a reminder, the empirical cumulative distribution function for a set of data point evaluated at x is

ECDF(x) = fraction of data points ≤ x.

In a previous lesson, we saw that if we have a Pandas Series stored in a variable named data, we can compute the values for the ECDF as data.rank(method='first') / len(data). If we had our data as a Pandas Series, we could use this method to make an ECDF plot as follows (which is more or less what is happening under the hood in bokeh_catplot.ecdf().

[2]:
# Dummy data set as a Pandas Series
rg = np.random.default_rng()
data = pd.Series(rg.normal(0, 1, size=100))

# Compute y-values for ECDF
ecdf_y = data.rank(method='first') / len(data)

# Make the plot
p = bokeh.plotting.figure(
    frame_height=200,
    frame_width=300,
    x_axis_label='x',
    y_axis_label='ECDF',
)

p.circle(data, ecdf_y)

bokeh.io.show(p)

The rank() method is not available for Numpy arrays. What we would like is a function we could call with a Numpy array as input that returns the x and y values for plotting. We do not want to have to convert the array to a Pandas Series first. Plus, doing this exercise helps you understand what an ECDF is, and therefore will help you interpret it.

Write a function with call signature

ecdf_vals(data)

which takes a one-dimensional NumPy array of data and returns the x and y values for plotting a “dot-style” ECDF. That is, each dot has a y value given by the ECDF evaluated at x.

When writing a function, you should have detailed doc strings and checking of input. However, here the focus is on your understanding of ECDFs and developing skills with NumPy, so you do not need to have a very descriptive doc string nor lots of input checking.

Hint: The functions np.sort() and np.arange() may be useful.

Solution

To construct the ECDF, we can sort the x-values. Given that they are sorted, the y-value of the ECDF at the nth data point in the sorted x-array (where n starts at 1) is 1/n.

[3]:
def ecdf_vals(data):
    """Return x and y values for an ECDF."""
    return np.sort(data), np.arange(1, len(data) + 1) / len(data)

Computing environment

show code
[4]:
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,jupyterlab
Python implementation: CPython
Python version       : 3.11.3
IPython version      : 8.12.0

numpy     : 1.24.3
pandas    : 1.5.3
bokeh     : 3.1.1
jupyterlab: 3.6.3