# Exercise 8.3: Understanding and building ECDFs¶

```
[1]:
```

```
import numpy as np
import pandas as pd
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
```

As a reminder, the empirical cumulative distribution function for a set of data point evaluated at x is

ECDF(x) = fraction of data points ≤ x.

In a previous lesson, we saw that if we have a Pandas `Series`

stored in a variable named `data`

, we can compute the values for the ECDF as `data.rank(method='first') / len(data)`

. If we had our data as a Pandas `Series`

, we could use this method to make an ECDF plot as follows (which is more or less what is happening under the hood in `bokeh_catplot.ecdf()`

.

```
[2]:
```

```
# Dummy data set as a Pandas Series
rg = np.random.default_rng()
data = pd.Series(rg.normal(0, 1, size=100))
# Compute y-values for ECDF
ecdf_y = data.rank(method='first') / len(data)
# Make the plot
p = bokeh.plotting.figure(
frame_height=200,
frame_width=300,
x_axis_label='x',
y_axis_label='ECDF',
)
p.circle(data, ecdf_y)
bokeh.io.show(p)
```

The `rank()`

method is not available for Numpy arrays. What we would like is a function we could call with a Numpy array as input that returns the x and y values for plotting. We do not want to have to convert the array to a Pandas `Series`

first. Plus, doing this exercise helps you understand what an ECDF is, and therefore will help you interpret it.

Write a function with call signature

```
ecdf_vals(data)
```

which takes a one-dimensional NumPy array of data and returns the `x`

and `y`

values for plotting a “dot-style” ECDF. That is, each dot has a `y`

value given by the ECDF evaluated at `x`

.

When writing a function, you should have detailed doc strings and checking of input. However, here the focus is on your understanding of ECDFs and developing skills with NumPy, so you do not need to have a very descriptive doc string nor lots of input checking.

*Hint*: The functions `np.sort()`

and `np.arange()`

may be useful.

## Solution¶

To construct the ECDF, we can sort the x-values. Given that they are sorted, the y-value of the ECDF at the nth data point in the sorted x-array (where n starts at 1) is 1/n.

```
[3]:
```

```
def ecdf_vals(data):
"""Return x and y values for an ECDF."""
return np.sort(data), np.arange(1, len(data)+1) / len(data)
```

## Computing environment¶

```
[4]:
```

```
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,jupyterlab
```

```
CPython 3.7.7
IPython 7.16.1
numpy 1.18.5
pandas 0.24.2
bokeh 2.1.1
jupyterlab 2.1.5
```