{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 4.1: Writing functions for bootstrap replicates\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It will be useful to have some functions in your arsenal to do statistical inference with bootstrapping. In this exercise, you will write some handy functions. You should test these functions out on real data. \n", "\n", "**a)** In the lessons, we wrote a function, `draw_bs_rep()` to draw a single bootstrap replicate out of a single set of repeated measurements. Update this function to have a `size` keyword argument so that you can draw many bootstrap replicates and return a Numpy array of the replicates. Here are step-by-step instructions.\n", "\n", "1. Define a function with call signature `draw_bs_reps(data, func, rg, size=1, args=())`, where `func` is a function that takes in an array and returns a statistic; it has call signature `func(data, *args)`. Examples that could be passed in as `func` are `np.mean`, `np.std`, `np.median`, or a user-defined function. `rg` is an instance of a Numpy random number generator. `size` is the number of replicates to generate.\n", "2. Write a good doc string.\n", "3. Define `n` to be the length of the input `data` array.\n", "4. Use a list comprehension to compute a list of bootstrap replicates.\n", "5. Return the replicates as a Numpy array.\n", "\n", "**b)** Write a function analogous to the one in part (a) except for pairs bootstrap. The call signature should be `draw_bs_pairs(data1, data2, func, rg, size=1, args=())`, where `func` has call signature `func(data1, data2, *args)`.\n", "\n", "You will want to include these in a module so you can use it over and over again. I will not be providing this functionality in the `bootcamp_utils` module; I want you to write this yourself. (Or, you can install the [dc_stat_think module](https://github.com/justinbois/dc_stat_think) that I wrote using `pip`, which has this and many other useful functions for bootstrapping.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**a)** I show the function below." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def draw_bs_reps(data, func, rg, size=1, args=()):\n", " \"\"\"\n", " Generate bootstrap replicates out of `data` using `func`.\n", "\n", " Parameters\n", " ----------\n", " data : array_like\n", " One-dimensional array of data.\n", " func : function\n", " Function, with call signature `func(data, *args)` to compute\n", " replicate statistic from resampled `data`.\n", " rg : random number generator instance\n", " Either `np.random` or the result of `np.random.default_rng()`.\n", " size : int, default 1\n", " Number of bootstrap replicates to generate.\n", " args : tuple, default ()\n", " Arguments to be passed to `func`.\n", "\n", " Returns\n", " -------\n", " output : ndarray\n", " Bootstrap replicates computed from `data` using `func`.\n", " \"\"\"\n", " return np.array(\n", " [\n", " func(rg.choice(data, replace=True, size=len(data)), *args)\n", " for _ in range(size)\n", " ]\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try this function out on the beak depth data from 1975 to get bootstrap replicates of the mean." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[8.8478046 9.07747701]\n" ] } ], "source": [ "rg = np.random.default_rng()\n", "\n", "df = pd.read_csv('data/grant_complete.csv', comment='#')\n", "inds = (df['year'] == 1975) & (df['species'] == 'scandens')\n", "bd_1975 = df.loc[inds, 'beak depth (mm)'].values\n", "\n", "# Compute replicates\n", "bs_reps = draw_bs_reps(bd_1975, np.mean, rg, size=2000)\n", "\n", "# 95% confidence interval\n", "print(np.percentile(bs_reps, [2.5, 97.5]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nice!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**b)** I show the function below." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def draw_bs_pairs(data1, data2, func, rg, size=1, args=()):\n", " \"\"\"\n", " Perform pairs bootstrap for single statistic.\n", "\n", " Parameters\n", " ----------\n", " data1 : array_like\n", " First data set. Must be same length as `data2`.\n", " data2 : array_like\n", " Second data set. Must be same length as `data1`.\n", " func : function\n", " Function, with call signature `func(data1, data2, *args)` to \n", " compute replicate statistic from pairs bootstrap sample. It \n", " must return a single, scalar value.\n", " rg : random number generator instance\n", " Either `np.random` or the result of `np.random.default_rng()`.\n", " size : int, default 1\n", " Number of pairs bootstrap replicates to draw.\n", " args : tuple, default ()\n", " Arguments to be passed to `func`.\n", "\n", " Returns\n", " -------\n", " output : ndarray\n", " Bootstrap replicates.\n", " \"\"\"\n", " n = len(data1)\n", "\n", " # Set up array of indices to sample from\n", " inds = np.arange(n)\n", "\n", " # Initialize replicates\n", " bs_replicates = np.empty(size)\n", "\n", " # Generate replicates\n", " for i in range(size):\n", " bs_inds = rg.choice(inds, n)\n", " bs_1, bs_2 = data1[bs_inds], data2[bs_inds]\n", " bs_replicates[i] = func(bs_1, bs_2, *args)\n", "\n", " return bs_replicates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take it for a spin with the correlation coefficient between the length and depth of beaks in 1975." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.45686897 0.75179156]\n" ] } ], "source": [ "bl_1975 = df.loc[inds, 'beak length (mm)'].values\n", "\n", "def corrcoef(data1, data2):\n", " cov = np.cov(data1, data2)\n", " return cov[0, 1] / np.sqrt(cov[0, 0] * cov[1, 1])\n", "\n", "\n", "# Compute replicates\n", "bs_reps = draw_bs_pairs(bd_1975, bl_1975, corrcoef, rg, size=2000)\n", "\n", "# 95% confidence interval\n", "print(np.percentile(bs_reps, [2.5, 97.5]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks good!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing environment" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python implementation: CPython\n", "Python version : 3.9.12\n", "IPython version : 8.3.0\n", "\n", "numpy : 1.21.5\n", "pandas : 1.4.2\n", "jupyterlab: 3.3.2\n", "\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -v -p numpy,pandas,jupyterlab" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }