Exercise 4.4: Understanding and building ECDFs
[1]:
import numpy as np
import pandas as pd
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
As a reminder, the empirical cumulative distribution function for a set of data point evaluated at x is
ECDF(x) = fraction of data points ≤ x.
In a previous lesson, we saw that if we have a Pandas Series
stored in a variable named data
, we can compute the values for the ECDF as data.rank(method='first') / len(data)
. If we had our data as a Pandas Series
, we could use this method to make an ECDF plot as follows (which is more or less what is happening under the hood in bokeh_catplot.ecdf()
.
[2]:
# Dummy data set as a Pandas Series
rg = np.random.default_rng()
data = pd.Series(rg.normal(0, 1, size=100))
# Compute y-values for ECDF
ecdf_y = data.rank(method='first') / len(data)
# Make the plot
p = bokeh.plotting.figure(
frame_height=200,
frame_width=300,
x_axis_label='x',
y_axis_label='ECDF',
)
p.scatter(data, ecdf_y)
bokeh.io.show(p)
The rank()
method is not available for Numpy arrays. What we would like is a function we could call with a Numpy array as input that returns the x and y values for plotting. We do not want to have to convert the array to a Pandas Series
first. Plus, doing this exercise helps you understand what an ECDF is, and therefore will help you interpret it.
Write a function with call signature
ecdf_vals(data)
which takes a one-dimensional NumPy array of data and returns the x
and y
values for plotting a “dot-style” ECDF. That is, each dot has a y
value given by the ECDF evaluated at x
.
When writing a function, you should have detailed doc strings and checking of input. However, here the focus is on your understanding of ECDFs and developing skills with NumPy, so you do not need to have a very descriptive doc string nor lots of input checking.
Hint: The functions np.sort()
and np.arange()
may be useful.