# Exercise 3

(c) 2019 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).

This document was prepared at [Caltech](http://www.caltech.edu) with financial support from the [Donna and Benjamin M. Rosen Bioengineering Center](http://rosen.caltech.edu).

<img src="caltech_rosen.png">

*This exercise was generated from a Jupyter notebook.  You can download the notebook [here](e03.ipynb).*

<hr />

In [1]:
import pandas as pd

<hr />

## Exercise 3.0: Complete practice exercises

Complete the practice exercises from [lesson 21](../lessons/l21_practice_with_pandas.html) and [lesson 24](../lessons/l24_practice_with_pandas_and_bokeh.html).

<br />

## Exercise 3.1: Automating scatter plots

Write a function that takes as input a tidy data frame and generates a scatter plot based on two columns of the data frame and colors the glyphs according to a third column that contains categorical variables. The minimal (you can add other kwargs if you want) call signature should be

```python
scatter(data, cat, x, y)
```

You will of course test out your function while writing it, and the next problems give you lots of opportunities to use it.

<br />

## Exercise 3.2: Adding data to a DataFrame

In [Lesson 23](../lessons/l23_high_level_plotting.html), we looked at a data set consisting of frog strikes. Recall that the header comments in the data file contained information about the frogs.

In [2]:
!head -20 data/frog_tongue_adhesion.csv

# These data are from the paper,
#   Kleinteich and Gorb, Sci. Rep., 4, 5225, 2014.
# It was featured in the New York Times.
#    http://www.nytimes.com/2014/08/25/science/a-frog-thats-a-living-breathing-pac-man.html
#
# The authors included the data in their supplemental information.
#
# Importantly, the ID refers to the identifites of the frogs they tested.
#   I:   adult, 63 mm snout-vent-length (SVL) and 63.1 g body weight,
#        Ceratophrys cranwelli crossed with Ceratophrys cornuta
#   II:  adult, 70 mm SVL and 72.7 g body weight,
#        Ceratophrys cranwelli crossed with Ceratophrys cornuta
#   III: juvenile, 28 mm SVL and 12.7 g body weight, Ceratophrys cranwelli
#   IV:  juvenile, 31 mm SVL and 12.7 g body weight, Ceratophrys cranwelli
date,ID,trial number,impact force (mN),impact time (ms),impact force / body weight,adhesive force (mN),time frog pulls on target (ms),adhesive force / body weight,adhesive impulse (N-s),total contact area (mm2),contact area without mucus (m

So, each frog has associated with it an age (adult or juvenile), snout-vent-length (SVL), body weight, and species (either cross or *cranwelli*). For a tidy `DataFrame`, we should have a column for each of these values. Your task is to load in the data, and then add these columns to the `DataFrame`. For convenience, here is a `DataFrame` with data about each frog.

In [3]:
df_frog = pd.DataFrame(data={'ID': ['I', 'II', 'III', 'IV'],
                             'age': ['adult', 'adult', 'juvenile', 'juvenile'],
                             'SVL (mm)': [63, 70, 28, 31],
                             'weight (g)': [63.1, 72.7, 12.7, 12.7],
                             'species': ['cross', 'cross', 'cranwelli', 'cranwelli']})

Note: There are lots of ways to solve this problem. This is a good exercise in searching through the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/) and other online resources, such as [Stack Overflow](https://stackoverflow.com/questions). Remember, much of your programming efforts are spent searching through documentation and the internet.

After you have added this information to the data frame, make a scatter plot of adhesive force versus impact force and color the points by whether the frog is a juvenile or adult. The function you wrote in [Exercise 3.1](#Exercise-3.1%3A-Automating-scatter-plots) will be useful to do this.

<br/>

## Exercise 3.3: Long-term trends in hybridization of Darwin finches

[Peter and Rosemary Grant](https://en.wikipedia.org/wiki/Peter_and_Rosemary_Grant) have been working on the GalÃ¡pagos island of Daphne Major for over forty years.  During this time, they have collected lots and lots of data about physiological features of finches.  In 2014, they published a book with a summary of some of their major results (Grant P. R., Grant B. R., *40 years of evolution. Darwin's finches on Daphne Major Island*, Princeton University Press, 2014). They made their data from the book publicly available via the [Dryad Digital Repository](http://dx.doi.org/10.5061/dryad.g6g3h).

We will investigate their measurements of beak depth (the distance, top to bottom, of a closed beak) and beak length (base to tip on the top) of Darwin's finches.  We will look at data from two species, *Geospiza fortis* and *Geospiza scandens*.  The Grants provided data on the finches of Daphne for the years 1973, 1975, 1987, 1991, and 2012.  I have included the data in the files `grant_1973.csv`, `grant_1975.csv`, `grant_1987.csv`, `grant_1991.csv`, and  `grant_2012.csv`. They are in almost exactly the same format is in the Dryad repository; I have only deleted blank entries at the end of the files.

**Note**: If you want to skip the wrangling (which is very valuable experience), you can go directly to part (d). You can load in the `DataFrame` you generate in parts (a) through (c) from the file `~/git/bootcamp/data/grant_complete.csv`.

**a)** Load each of the files into separate Pandas `DataFrame`s.  You might want to inspect the file first to make sure you know what character the comments start with and if there is a header row.

**b)** We would like to merge these all into one `DataFrame`.  The problem is that they have different header names, and only the 1973 file has a year entry (called `yearband`).  This is common with real data.  It is often a bit messy and requires some wrangling.  

>1. First, change the name of the `yearband` column of the 1973 data to `year`.  Also, make sure the year format is four digits, not two!  
>2. Next, add a `year` column to the other four `DataFrame`s.  You want tidy data, so each row in the `DataFrame` should have an entry for the year.
>3. Change the column names so that all the `DataFrame`s have the same column names.  I would choose column names
>
>    `['band', 'species', 'beak length (mm)', 'beak depth (mm)', 'year']`
>
>4. Concatenate the `DataFrame`s into a single `DataFrame`. Be careful with indices! If you use `pd.concat()`, you will need to use the `ignore_index=True` kwarg. You might also need to use the `axis` kwarg.

**c)** The `band` field gives the number of the band on the bird's leg that was used to tag it.  Are some birds counted twice?  Are they counted twice in the same year?  Do you think you should drop duplicate birds from the same year?  How about different years?  My opinion is that you should drop duplicate birds from the same year and keep the others, but I would be open to discussion on that.  To practice your Pandas skills, though, let's delete only duplicate birds from the same year from the `DataFrame`.  When you have made this `DataFrame`, save it as a CSV file.

*Hint*: The `DataFrame` methods `duplicated()` and `drop_duplicates()` will be useful.

After doing this work, it is worth saving your tidy `DataFrame` in a CSV document. To this using the `to_csv()` method of your `DataFrame`. Since the indices are uninformative, you should use the `index=False` kwarg. (I have already done this and saved it as `~/git/bootcamp/data/grant_complete.csv`, which will help you do the rest of the exercise if you have problems with this part.)

**d)** Make a plots exploring how beak depth changes over time for each species. Think about what might be effective ways to display the data.

**e)** It is informative to plot the measurement of each bird's beak as a point in the beak depth-beak length plane.  For the 1987 data, plot beak depth vs. beak width for *Geospiza fortis* and for *Geospiza scandens*. The function you wrote in [Exercise 3.1](#Exercise-3.1%3A-Automating-scatter-plots) will be useful to do this.

**f)** Do part (d) again for all years. _Hint_: To display all of the plots, check out the [Bokeh documentation for layouts](https://bokeh.pydata.org/en/latest/docs/user_guide/layout.html).  In your plots, make sure all plots have the same range on the axes. If you want to set two plots, say `p1` and `p2` to have the same axis ranges, you can do the following.

```python
p1.x_range = p2.x_range
p1.y_range = p2.y_range
```

## Computing environment

In [29]:
%load_ext watermark
%watermark -v -p pandas,bokeh,bokeh_catplot,jupyterlab

CPython 3.7.3
IPython 7.1.1

pandas 0.24.2
bokeh 1.2.0
bokeh_catplot 0.1.0
jupyterlab 0.35.5
