Exercise 4.1: Long-term trends in hybridization of Darwin finches

Peter and Rosemary Grant have been working on the Galápagos island of Daphne Major for over forty years. During this time, they have collected lots and lots of data about physiological features of finches. In 2014, they published a book with a summary of some of their major results (Grant P. R., Grant B. R., 40 years of evolution. Darwin’s finches on Daphne Major Island, Princeton University Press, 2014). They made their data from the book publicly available via the Dryad Digital Repository.

We will investigate their measurements of beak depth (the distance, top to bottom, of a closed beak) and beak length (base to tip on the top) of Darwin’s finches. We will look at data from two species, Geospiza fortis and Geospiza scandens. The Grants provided data on the finches of Daphne for the years 1973, 1975, 1987, 1991, and 2012. I have included the data in the files grant_1973.csv, grant_1975.csv, grant_1987.csv, grant_1991.csv, and grant_2012.csv. They are in almost exactly the same format is in the Dryad repository; I have only deleted blank entries at the end of the files.

Note: If you want to skip the wrangling (which is very valuable experience), you can go directly to part (d). You can load in the data frame you generate in parts (a) through (c) from the file ~/git/bootcamp/data/grant_complete.csv.

a) Load each of the files into separate Pandas data frames. You might want to inspect the file first to make sure you know what character the comments start with and if there is a header row.

b) We would like to merge these all into one data frame. The problem is that they have different header names, and only the 1973 file has a year entry (called yearband). This is common with real data. It is often a bit messy and requires some wrangling.

First, change the name of the yearband column of the 1973 data to year. Also, make sure the year format is four digits, not two!
Next, add a year column to the other four data frames. You want tidy data, so each row in the data frame should have an entry for the year.
Change the column names so that all the data frames have the same column names. I would choose column names

['band', 'species', 'beak length (mm)', 'beak depth (mm)', 'year']
Concatenate the data frames into a single data frame. Be careful with indices! If you use pd.concat(), you will need to use the ignore_index=True kwarg. You might also need to use the axis kwarg.

c) The band field gives the number of the band on the bird’s leg that was used to tag it. Are some birds counted twice? Are they counted twice in the same year? Do you think you should drop duplicate birds from the same year? How about different years? My opinion is that you should drop duplicate birds from the same year and keep the others, but I would be open to discussion on that. To practice your Pandas skills, though, let’s delete only duplicate birds from the same year from the data frame. When you have made this data frame, save it as a CSV file.

Hint: The data frame methods duplicated() and drop_duplicates() will be useful.

After doing this work, it is worth saving your tidy data frame in a CSV document. To this using the to_csv() method of your data frame. Since the indices are uninformative, you should use the index=False kwarg. (I have already done this and saved it as ~/git/bootcamp/data/grant_complete.csv, which will help you do the rest of the exercise if you have problems with this part.)

d) Make a plots exploring how beak depth changes over time for each species. Think about what might be effective ways to display the data.

e) It is informative to plot the measurement of each bird’s beak as a point in the beak depth-beak length plane. For the 1987 data, plot beak depth vs. beak width for Geospiza fortis and for Geospiza scandens. The function you wrote in Exercise 3.5 will be useful to do this.

f) Do part (d) again for all years. Hint: To display all of the plots, check out the Bokeh documentation for layouts. In your plots, make sure all plots have the same range on the axes. If you want to set two plots, say p1 and p2 to have the same axis ranges, you can do the following.

p1.x_range = p2.x_range
p1.y_range = p2.y_range

Solution

[1]:

import glob

import pandas as pd

import iqplot
import colorcet

import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()

Loading BokehJS ...

Upon inspecting the files, we see that the comment character is, as usual, #. There is also a header row in each file, as the first row, so they files are pretty standard. It is important to note that not all of the column headings are the same, but the units of length in the measurements is millimeters. Let’s go ahead and load them in! We will load them into a list. I will use the glob module to load in all the csv files with the substring 'grant'.

[2]:

# Get list of CSV files
csv_list = glob.glob('data/grant*19*.csv') + glob.glob('data/grant*20*.csv')

# Sort the list so we keep the years in order
csv_list.sort()

# Initialize list of DataFrames
df_list = []

# Load in each sequentially.
for csv_file in csv_list:
    # Read in DataFrame
    df = pd.read_csv(csv_file, comment='#')

    # Place in list
    df_list.append(df)

Let’s take a quick look at the first entry in the list, just to make sure it loaded ok.

[3]:

df_list[0].head()

[3]:

	band	species	yearband	beak length	beak depth
0	20123	fortis	73	9.25	8.05
1	20126	fortis	73	11.35	10.45
2	20128	fortis	73	10.15	9.55
3	20129	fortis	73	9.95	8.75
4	20133	fortis	73	11.55	10.15

Looks good!

b) Before moving on, we need to know what year is associated with each data frame. Fortunately, we sorted the list of CSV files, we have the years in order. We can extract the year from the file names.

[4]:

# Initialize years
years = []
for csv_file in csv_list:
    years.append(int(csv_file[-8:-4]))

Let’s check to make sure we got them, and sort them so we know the order.

[5]:

years

[5]:

[1973, 1975, 1987, 1991, 2012]

Looks good. Now, we’ll proceed with the steps we need to take to clean things up. First, we’ll change the 'yearband' column in the data frame from 1973 to 'year', and change its year from 73 to 1973.

[6]:

df_list[0].columns

[6]:

Index(['band', 'species', 'yearband', 'beak length', 'beak depth'], dtype='object')

[7]:

# Rename to year
df_list[0] = df_list[0].rename(columns={'yearband': 'year'})

# No worries about Y2K
df_list[0]['year'] += 1900

# Check it out
df_list[0].head()

[7]:

	band	species	year	beak length	beak depth
0	20123	fortis	1973	9.25	8.05
1	20126	fortis	1973	11.35	10.45
2	20128	fortis	1973	10.15	9.55
3	20129	fortis	1973	9.95	8.75
4	20133	fortis	1973	11.55	10.15

Great! Let’s proceed to add a year column to all of the other data frames. As we do it, we’ll just reassign the 1973 year in that data frame, but that’s no big deal.

[8]:

for i, df in enumerate(df_list):
    df_list[i]['year'] = years[i]

Let’s check one to make sure it makes sense.

[9]:

df_list[3].head()

[9]:

	band	species	blength	bdepth	year
0	2639	fortis	10.30	8.95	1991
1	2666	fortis	12.81	9.30	1991
2	2753	fortis	10.89	10.35	1991
3	2776	fortis	11.30	10.00	1991
4	4229	fortis	10.05	8.62	1991

Looks good. Now, we need to change the column names so they are all the same for the respective data frames. We have few enough data frames that we could do that by hand, but it is more instructive (and re-usable) if we automate it. We will write a function to rename the columns. It first sniffs out which column should be 'band', which should be 'species', and so on. We can do this with Pandas’s convenient .str methods, which enable us to use string methods on many entries at once. This is perhaps best seen by example.

[10]:

# Choose a DataFrame to try it on.
df = df_list[3]

# Look at the columns
df.columns

[10]:

Index(['band', 'species', 'blength', 'bdepth', 'year'], dtype='object')

Now, if we are interested in the beak length column, we want to find a column heading that contains 'len', since pretty much anything that is about beak length would have the substring. We can use the convenient str.contains() method.

[11]:

# See which column had 'len' in it
df.columns.str.contains('len')

[11]:

array([False, False,  True, False, False])

Now, we can slice out the column heading that has 'len' in it.

[12]:

df.columns[df.columns.str.contains('len')]

[12]:

Index(['blength'], dtype='object')

Finally, we just want the string, so we do

[13]:

df.columns[df.columns.str.contains('len')][0]

[13]:

'blength'

We’ll use this to identify the current column headings and then change them to what we want.

[14]:

def rename_cols(df):
    """Rename columns so all DataFrames have same column headings."""
    # Sniff out the key names from names that are close
    band_key = df.columns[df.columns.str.contains("and")][0]
    species_key = df.columns[df.columns.str.contains("ecies")][0]
    length_key = df.columns[df.columns.str.contains("len")][0]
    depth_key = df.columns[df.columns.str.contains("dep")][0]
    year_key = df.columns[df.columns.str.contains("year")][0]

    # Rename the columns using renaming dictionary
    return df.rename(
        columns={
            band_key: "band",
            species_key: "species",
            depth_key: "beak depth (mm)",
            length_key: "beak length (mm)",
            year_key: "year",
        }
    )

Now, we can loop through the data frames and rename the columns.

[15]:

for i, df in enumerate(df_list):
    df_list[i] = rename_cols(df)

# Check the result
df_list[3].head()

[15]:

	band	species	beak length (mm)	beak depth (mm)	year
0	2639	fortis	10.30	8.95	1991
1	2666	fortis	12.81	9.30	1991
2	2753	fortis	10.89	10.35	1991
3	2776	fortis	11.30	10.00	1991
4	4229	fortis	10.05	8.62	1991

Finally, we do the concatenation using pd.concat(). We want to ignore the indices because they are not important identifiers.

[16]:

df = pd.concat(df_list, axis=0, ignore_index=True, sort=True)

# Take a look
df

[16]:

	band	beak depth (mm)	beak length (mm)	species	year
0	20123	8.05	9.25	fortis	1973
1	20126	10.45	11.35	fortis	1973
2	20128	9.55	10.15	fortis	1973
3	20129	8.75	9.95	fortis	1973
4	20133	10.15	11.55	fortis	1973
...	...	...	...	...	...
2299	21295	9.30	14.20	scandens	2012
2300	21297	9.80	13.00	scandens	2012
2301	21340	8.90	14.60	scandens	2012
2302	21342	9.80	13.10	scandens	2012
2303	21347	9.10	15.20	scandens	2012

2304 rows × 5 columns

Great! We now have one convenient tidy data frame to work with.

c) First, let’s look for duplicate band numbers. There are many, so we’ll just write out how many. The df.duplicated() method returns True for each row if it is a duplicate. We will get all duplicates in the 'band' column, and then get the unique values in the list of all duplicated. This will tell us how many birds were measured more than once.

[17]:

# Stats about how many birds were measured more than once
print(
    "There were",
    len(df["band"][df["band"].duplicated()].unique()),
    "birds that were measured more than once.",
)

print("There were", len(df["band"].unique()), "total birds measured.")

There were 350 birds that were measured more than once.
There were 1954 total birds measured.

So, most birds were only measured once. Nonetheless, let’s eliminate duplicates of birds that were measured twice. When we drop the duplicates, we will keep the first measurement.

[18]:

# Drop all rows with matching year and band (keep first)
df = df.drop_duplicates(subset=['year', 'band'])

Finally, we will save the data frame as a CSV file using the df.to_csv() method. We do not want to print the indices (they are meaningless).

[19]:

df.to_csv('data/grant_complete.csv', index=False)

d) As a first pass, I will use a strip plot showing both species by year.

[20]:

p = iqplot.strip(
    data=df,
    q='beak depth (mm)',
    cats=['year', 'species'],
    color_column='species',
    spread='jitter',
    marker_kwargs=dict(alpha=0.3),
    frame_height=550,
)

bokeh.io.show(p)

The plot immediately shows that there are more G. fortis measurements and they are more broadly distributed. There seems to be a big jump toward shallower beaks from 1991 to 2012 for G. fortis, but the G. scandens beak depth stays relatively constant.

To better see the distributions, we can plot ECDFs.

[21]:

palette_fortis = bokeh.palettes.Blues9
p_fortis = iqplot.ecdf(
    data=df.loc[df['species']=='fortis', :],
    q='beak depth (mm)',
    cats='year',
    palette=palette_fortis,
    frame_height=250,
    title='fortis',
)

palette_scandens = bokeh.palettes.Oranges9
p_scandens = iqplot.ecdf(
    data=df.loc[df['species']=='scandens', :],
    q='beak depth (mm)',
    cats='year',
    palette=palette_scandens,
    frame_height=250,
    title='scandens',
)

p_scandens.x_range = p_fortis.x_range

bokeh.io.show(bokeh.layouts.column([p_fortis, p_scandens]))

The big jump from 1991 to 2012 for G. fortis is really clear in the ECDF. In both cases, there appears to be a jump toward shallower beaks between 1973 and 1975, but there are not very many data points in 1973, so it is difficult to tell. (We will learn how to quantify this uncertainty in future lessons when we do bootstrapping.)

e) Now let’s make a plot of the 1987 data in the beak depth-beak length plane. We can use the function we wrote in Exercise 3.5.

[22]:

def scatter(
    data=None,
    x=None,
    y=None,
    cat=None,
    p=None,
    palette=None,
    show_legend=True,
    click_policy="hide",
    marker_kwargs={},
    **kwargs,
):
    """
    Parameters
    ----------
    df : Pandas DataFrame
        DataFrame containing tidy data for plotting.
    x : hashable
        Name of column to use as x-axis.
    y : hashable
        Name of column to use as y-axis.
    cat : hashable
        Name of column to use as categorical variable.
    p : bokeh.plotting.Figure instance, or None (default)
        If None, create a new figure. Otherwise, populate the existing
        figure `p`.
    palette : list of strings of hex colors, or single hex string
        If a list, color palette to use. If a single string representing
        a hex color, all glyphs are colored with that color. Default is
        the Glasbey Catagory 10.
    show_legend : bool, default False
        If True, show legend.
    tooltips : list of 2-tuples
        Specification for tooltips as per Bokeh specifications. For
        example, if we want `col1` and `col2` tooltips, we can use
        `tooltips=[('label 1': '@col1'), ('label 2': '@col2')]`. Ignored
        if `formal` is True.
    show_legend : bool, default False
        If True, show a legend.
    click_policy : str, default "hide"
        How to display points when their legend entry is clicked.
    marker_kwargs : dict
        kwargs to be passed to `p.scatter()` when making the scatter plot.
    kwargs
        Any kwargs to be passed to `bokeh.plotting.figure()` when making
        the plot.

    Returns
    -------
    output : bokeh.plotting.Figure instance
        Plot populated with jitter plot or box plot.
    """
    # Automatically name the axes
    if "x_axis_label" not in kwargs:
        kwargs["x_axis_label"] = x
    if "y_axis_label" not in kwargs:
        kwargs["y_axis_label"] = y

    # Default palette
    if palette is None:
        palette = colorcet.b_glasbey_category10
    elif type(palette) == str:
        palette = [palette]

    # Instantiate figure
    if 'toolbar_location' not in kwargs:
        kwargs['toolbar_location'] = 'above'
    if p is None:
        p = bokeh.plotting.figure(**kwargs)

    # Build plot (not using color factors) to enable click policies
    legend_items = []
    for i, (name, g) in enumerate(data.groupby(cat, sort=False)):
        marker_kwargs["color"] = palette[i % len(palette)]
        legend_items.append((str(name), [p.scatter(source=g, x=x, y=y, **marker_kwargs)]))

    if show_legend:
        # Add the legend
        legend = bokeh.models.Legend(items=legend_items, click_policy=click_policy)

        # Add the legend to the right of the plot
        p.add_layout(legend, 'right')

    return p

f) To generate all of the plots, we can loop through a GroupBy object.

[23]:

plots = []
for name, group in df.groupby('year'):
    p = scatter(
        data=group,
        x='beak length (mm)',
        y='beak depth (mm)',
        cat='species',
        show_legend=False,
        marker_kwargs=dict(alpha=0.3),
        title=str(name),
        frame_height=250,
        frame_width=250,
    )
    plots.append(p)

# Link the ranges
for i in range(1, len(plots)):
    plots[i].x_range = plots[0].x_range
    plots[i].y_range = plots[0].y_range

bokeh.io.show(bokeh.layouts.column(plots))

When we look at the data this way, we see the the species coming together.

Computing environment

[24]:

%load_ext watermark
%watermark -v -p pandas,bokeh,iqplot,jupyterlab

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.20.0

pandas    : 2.2.1
bokeh     : 3.4.1
iqplot    : 0.3.6
jupyterlab: 4.0.13