Lesson 23: Random number generation


[1]:
import numpy as np
import pandas as pd
import scipy.stats

import iqplot

import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()
Loading BokehJS ...

Random number generation (RNG) is the process by which a string of random numbers may be drawn. Of course, the numbers are not completely random for several reasons.

  1. They are drawn from a probability distribution. The most common one is the uniform distribution on the domain \(0 \le x < 1\), i.e., random numbers between zero and one. (“Completely random” does not make sense because of the infinite magnitude of numbers.)

  2. In most computer applications, including the ones we’ll use in bootcamp, the random numbers are actually pseudorandom. They depend entirely on an input seed and are then generated by a deterministic algorithm from that seed.

This is a bit academic. Let’s jump right in generating random numbers. Much of the random number generation functionality you will need is in the np.random module. Let’s start by generating random numbers from a Uniform distribution.

[2]:
np.random.uniform(low=0, high=1, size=10)
[2]:
array([0.23709656, 0.95933636, 0.71928912, 0.99192717, 0.86348479,
       0.35528646, 0.14223835, 0.9889348 , 0.98321084, 0.26443901])

The function uniform() in the np.random module generates random numbers on the interval [low, high) from a Uniform distribution. The size kwarg is how many random numbers you wish to generate, and is a kwarg in all of Numpy’s random number generators. The random numbers are returned as a NumPy array.

We can check to make sure it is appropriately drawing random numbers out of the uniform distribution by plotting the cumulative distribution function, just like we did last time. We’ll generate 1,000 random numbers and plot them along with the CDF of a Uniform distribution.

[3]:
# Generate random numbers
x = np.random.uniform(low=0, high=1, size=1000)

# Plot the ECDF of randomly generated numbers
p = iqplot.ecdf(x)

# Overlay the theoretical CDF
p.line(
    x=[0, 1],
    y=[0, 1],
    line_width=2,
    line_color="orange",
)

bokeh.io.show(p)

So, it looks like our random number generator is doing a good job.

Generating random numbers on the uniform interval is one of the most commonly used RNG techniques. In fact, many of the other contexts of RNG are derived from draws from the uniform distribution. For example, you can simulate flipping a biased (unfair) coin.

[4]:
# Generate 20 random numbers on uniform interval
x = np.random.uniform(low=0, high=1, size=20)

# Make the coin flips (< 0.7 means we have a 70% chance of heads)
heads = x < 0.7

# Show which were heads, and count the number of heads
print(heads)
print("\nThere were", np.sum(heads), "heads.")
[ True  True False  True  True  True  True  True False  True  True  True
  True False False False  True False False  True]

There were 13 heads.

Choice of generator

As of version 1.17.3 of Numpy, the algorithm under the hood of calls to functions like np.random.uniform() is the Mersenne Twister Algorithm for generating random numbers. It is a very widely used and reliable method for generating random numbers. However, starting with version 1.17.4, the numpy.random module offers random number generators with better statistical performance, including the PCG64 generator. Going forward, the preferred approach to doing random number generation is to first instantiate a generator of your choice, and then use its methods to generate numbers out of probability distributions.

Let’s set up a PCG64 generator, which is Numpy’s default.

[5]:
rg = np.random.default_rng()

Now that we have the generator, we can use it to draw numbers out of distributions. The syntax is the same as before, except rg replaces np.random.

[6]:
rg.uniform(low=0, high=1, size=20)
[6]:
array([0.15611741, 0.55145422, 0.62460066, 0.66880145, 0.76751957,
       0.6096622 , 0.62798359, 0.06204101, 0.68672077, 0.77944418,
       0.97294838, 0.50955991, 0.78487589, 0.16892321, 0.34470479,
       0.64785548, 0.01584066, 0.58388327, 0.52010477, 0.16652411])

Seeding random number generators

Now, just to demonstrate that random number generation is deterministic, we will explicitly seed the random number generator (which is usually seeded with a number representing the date/time to avoid repeats) to show that we get the same random numbers.

[7]:
# Instantiate generator with a seed
rg = np.random.default_rng(seed=3252)

# Generate random numbers
rg.uniform(size=10)
[7]:
array([0.18866535, 0.04418857, 0.02961285, 0.22083971, 0.43341773,
       0.13166813, 0.42112164, 0.43507845, 0.61380912, 0.30627603])
[8]:
# Re-seed the RNG
rg = np.random.default_rng(seed=3252)

# Generate random numbers
rg.uniform(size=10)
[8]:
array([0.18866535, 0.04418857, 0.02961285, 0.22083971, 0.43341773,
       0.13166813, 0.42112164, 0.43507845, 0.61380912, 0.30627603])

The random number sequence is exactly the same. If we choose a different seed, we get totally different random numbers.

[9]:
# Fit one more fan in the 3252
rg = np.random.default_rng(seed=3253)
rg.uniform(size=10)
[9]:
array([0.31390226, 0.73012457, 0.05800998, 0.01557021, 0.29825701,
       0.10106784, 0.06329107, 0.58614237, 0.52023168, 0.52779988])

If you are writing tests, it is often useful to seed the random number generator to get reproducible results.

Drawing random numbers out of other distributions

We can also draw random numbers from other probability distributions. For example, say we wanted to draw random samples from a Normal distribution with mean μ and standard deviation σ. (We already saw this example when we were looking at histograms, but we repeat it here.)

[10]:
# Set parameters
mu = 10
sigma = 1

# Draw 100000 random samples
x = rg.normal(mu, sigma, size=100000)

# Plot the histogram
p = iqplot.histogram(
    x,
    density=True,
    rug=False,
    y_axis_label="approximate PDF",
)

bokeh.io.show(p)

It looks Normal, but, again, comparing the resulting ECDF is a better way to look at this. We’ll check out the ECDF with 1000 samples so as not to choke the browser. I will also make use of the theoretical CDF for the Normal distribution available from the scipy.stats module.

[11]:
# Compute theoretical CDF
x_theor = np.linspace(6, 14, 400)
y_theor = scipy.stats.norm.cdf(x_theor, mu, sigma)

# Plot the ECDF of randomly generated numbers
p = iqplot.ecdf(x[:1000])

p.line(
    x=x_theor,
    y=y_theor,
    line_width=2,
    line_color="orange",
)

bokeh.io.show(p)

Yup, right on!

Selections from discrete distributions

The random numbers we have generated so far from from continuous probability distributions. We can also draw random numbers from discrete distributions. We already showed how we can do this for “coin flips,” but we can do it for other distributions as well. Say we wanted to draw from a Binomial distribution. We can use rg.binomial().

[12]:
# Draw how many coin flips land heads in 10 files
rg.binomial(10, 0.5)
[12]:
7

There are other discrete distributions we can draw from, such as Binomial, Geometric, Poisson, etc., and the documentation describes how to use them.

Choosing elements from an array

It is often useful to randomly choose elements from an existing array. The rg.choice() function does this. You equivalently could do this using rg.integers(), where the integers represent indices in the array, except rg.choice() has a great keyword argument, replace, which allows random draws with or without replacement. For example, say you had 52 samples that you wanted to send to a facility for analysis, but you can only afford to send 20. If we used rg.integers(), we might have a problem.

[13]:
rg = np.random.default_rng(seed=3252)
np.sort(rg.integers(0, 52, size=20))
[13]:
array([ 1,  2,  2,  6,  7,  9, 11, 11, 11, 13, 15, 18, 21, 22, 22, 24, 31,
       34, 37, 39])

Samples 2, 11, and 22 we each selected twice!

[14]:
rg.choice(np.arange(52), size=20, replace=False)
[14]:
array([ 5, 38, 44, 12, 49, 10, 47, 46, 40, 23, 18,  1, 13, 21, 42, 28, 29,
       37, 26, 32])

Now, because we chose replace=False, we do not get any repeats.

Generating random sequences

Because it works with selecting characters as well as numbers, we can use the rg.choice() function to generate random DNA sequences.

[15]:
''.join(rg.choice(list('ATGC'), replace=True, size=70))
[15]:
'CAGGAGTCGCTGGGATAACAATTGTGACCTATGTAACTCAGCGAAGAGACTCGGGCCCGACCCACTAAAG'

Shuffling an array

Similarly, the rg.permutation() function is useful. It takes the entries in an array and shuffles them! Let’s shuffle a deck of cards.

[16]:
rg.permutation(np.arange(52))
[16]:
array([51, 36, 30, 35,  2,  9, 19, 40, 42, 32, 21, 45, 43, 26, 16, 27,  0,
       25, 41, 33, 24, 47, 10, 11,  5, 14, 23, 17, 29, 12, 13, 28, 48, 22,
        8, 46,  7, 49, 44, 31,  6, 34, 50,  1, 18, 20, 15, 39,  4, 38,  3,
       37])

When do we need RNG?

Answer: VERY OFTEN! We will see many examples in the next lessons and in the exercises.

In many ways, probability is the language of biology. Molecular processes have energetics that are comparable to the thermal energy, which means they are always influenced by random thermal forces. The processes of the central dogma, including DNA replication, are no exceptions. This gives rise to random mutations, which are central to understanding how evolution works. If we want to understand them, it is often useful to use random number generators to model the processes.

RNG also comes up A LOT in data analysis, which we will see in the lessons on hacker stats.

Computing environment

[17]:
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,bokeh,iqplot,jupyterlab
Python implementation: CPython
Python version       : 3.9.12
IPython version      : 8.3.0

numpy     : 1.21.5
scipy     : 1.7.3
pandas    : 1.4.2
bokeh     : 2.4.2
iqplot    : 0.2.4
jupyterlab: 3.3.2