Exercise 4.2: Hacker stats with bee sperm data


Neonicotinoid pesticides are thought to have inadvertent effects on service-providing insects such as bees. A study of this was featured in the New York Times in 2016. The original paper is Straub, et al., Proc. Royal Soc. B 283(1835): 20160506. Straub and coworkers put their data in the Dryad repository, which means we can work with it!

(Do you see a trend here? If you want people to think deeply about your results, explore them, learn from them, further science with them, make your data publicly available. Strongly encourage the members of your lab to do the same.)

We will look at the weight of drones (male bees) using the data set stored in ~/git/bootcamp/data/bee_weight.csv and the sperm quality of drone bees using the data set stored in ~/git/bootcamp/data/bee_sperm.csv.

a) Load the drone weight data in as a Pandas DataFrame. Note that the unit of the weight is milligrams (mg).

b) Plot ECDFs of the drone weight for control and also for those exposed to pesticide. Do you think there is a clear difference?

c) Compute the mean drone weight for control and those exposed to pesticide. Compute 95% bootstrap confidence intervals on the mean.

d) Repeat parts (a)-(c) for drone sperm. Use the 'Quality' column as your measure. This is defined as the percent of sperm that are alive in a 500 µL sample.

e) As you have seen in your analysis in part (d), both the control and pesticide treatments have some outliers with very low sperm quality. This can tug heavily on the mean. So, get 95% bootstrap confidence intervals for the median sperm quality of the two treatments.

Solution

[1]:
import numpy as np
import pandas as pd

import iqplot

import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()
Loading BokehJS ...

a) After inspecting the data set, we see that the comments are given by #, and this is a standard CSV file.

[2]:
df_weight = pd.read_csv('data/bee_weight.csv', comment='#')

b) We will plot the ECDFs with confidence intervals to help visualize the difference between them.

[3]:
p = iqplot.ecdf(
    data=df_weight,
    q='Weight',
    cats='Treatment',
    conf_int=True,
    x_axis_label='weight (mg)',
)

bokeh.io.show(p)

There is strong overlap of the ECDFs, which suggests there is no difference between pesticide and control. Now, let’s compute confidence intervals on the mean weight of the drones.

c) First, we’ll get point estimates for the mean weight under control and pesticide conditions.

[4]:
mean_control = np.mean(df_weight.loc[df_weight['Treatment']=='Control', 'Weight'])
mean_pest = np.mean(df_weight.loc[df_weight['Treatment']=='Pesticide', 'Weight'])

print('Mean control:  ', mean_control, 'mg')
print('Mean pesticide:', mean_pest, 'mg')
Mean control:   277.0563 mg
Mean pesticide: 278.27333333333326 mg

The means are really close. Let’s now compute the confidence intervals. We’ll use the bootstrap replicate generating function we wrote in Exercise 4.1.

[5]:
def draw_bs_reps(data, func, rg, size=1, args=()):
    return np.array(
        [
            func(rg.choice(data, replace=True, size=len(data)), *args)
            for _ in range(size)
        ]
    )


rg = np.random.default_rng()

# Draw 10,000 bootstrap reps for both.
bs_reps_control = draw_bs_reps(
    df_weight.loc[df_weight["Treatment"] == "Control", "Weight"].values,
    np.mean,
    rg,
    size=10000,
)
bs_reps_pest = draw_bs_reps(
    df_weight.loc[df_weight["Treatment"] == "Pesticide", "Weight"].values,
    np.mean,
    rg,
    size=10000,
)

Now, we can use np.percentile() to compute the 95% confidence interval.

[6]:
conf_int_control = np.percentile(bs_reps_control, [2.5, 97.5])
conf_int_pest = np.percentile(bs_reps_pest, [2.5, 97.5])

print('Confidence interval for control:', conf_int_control)
print('Confidence interval for pesticide:', conf_int_pest)
Confidence interval for control: [274.69325875 279.37617   ]
Confidence interval for pesticide: [275.06833333 281.51764583]

They have nearly the same confidence interval, as we would expect from the ECDFs.

d) We just go through the same steps as before. First, the ECDF.

[7]:
# Load data set
df_sperm = pd.read_csv('data/bee_sperm.csv', comment='#')

# Make ECDF
p = iqplot.ecdf(
    data=df_sperm,
    q='Quality',
    cats='Treatment',
    conf_int=True,
    x_axis_label='quality',
)

p.legend.location = 'top_left'

bokeh.io.show(p)

We have some very low quality samples from both, but it is pretty clear that on a whole the pesticide samples have much lower sperm quality. Let’s compute the confidence interval on the mean. We have to be careful, though, because there are some NaNs in the data set, so we have to use dropna().

[8]:
# Draw 10,000 bootstrap reps for both.
bs_reps_control = draw_bs_reps(
    df_sperm.loc[df_sperm["Treatment"] == "Control", "Quality"].dropna().values,
    np.mean,
    rg,
    size=10000,
)
bs_reps_pest = draw_bs_reps(
    df_sperm.loc[df_sperm["Treatment"] == "Pesticide", "Quality"].dropna().values,
    np.mean,
    rg,
    size=10000,
)

# Compute and print confidence interval
conf_int_control = np.percentile(bs_reps_control, [2.5, 97.5])
conf_int_pest = np.percentile(bs_reps_pest, [2.5, 97.5])

print("Confidence interval for control:", conf_int_control)
print("Confidence interval for pesticide:", conf_int_pest)
Confidence interval for control: [84.26541319 89.46672476]
Confidence interval for pesticide: [73.98761529 82.00900716]

The confidence intervals of the mean do not overlap, further confirming that the pesticide-tested drones have lower sperm quality.

e) Now, let’s try bootstrapping the median. This is the same procedure as before, except we just put np.median for our function where we have np.median.

[9]:
# Draw 10,000 bootstrap reps for both.
bs_reps_control = draw_bs_reps(
    df_sperm.loc[df_sperm["Treatment"] == "Control", "Quality"].dropna().values,
    np.median,
    rg,
    size=10000,
)
bs_reps_pest = draw_bs_reps(
    df_sperm.loc[df_sperm["Treatment"] == "Pesticide", "Quality"].dropna().values,
    np.median,
    rg,
    size=10000,
)

# Compute and print confidence interval
conf_int_control = np.percentile(bs_reps_control, [2.5, 97.5])
conf_int_pest = np.percentile(bs_reps_pest, [2.5, 97.5])

print("Confidence interval for control:", conf_int_control)
print("Confidence interval for pesticide:", conf_int_pest)
Confidence interval for control: [89.70654852 94.2353021 ]
Confidence interval for pesticide: [80.0560062  85.84421182]

Again, we see that the confidence intervals do not overlap. The median is of course higher than the mean, since the low-quality outliers have little effect on the median.

Computing environment

[10]:
%load_ext watermark
%watermark -v -p numpy,pandas,bokeh,iqplot,jupyterlab
Python implementation: CPython
Python version       : 3.9.12
IPython version      : 8.3.0

numpy     : 1.21.5
pandas    : 1.4.2
bokeh     : 2.4.2
iqplot    : 0.2.4
jupyterlab: 3.3.2