(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This exercise was generated from a Jupyter notebook. You can download the notebook here.
import pandas as pd
!head -20 data/frog_tongue_adhesion.csv
So, each frog has associated with it an age (adult or juvenile), snout-vent-length (SVL), body weight, and species (either cross or cranwelli). For a tidy DataFrame
, we should have a column for each of these values. Your task is to load in the data, and then add these columns to the DataFrame
. For convenience, here is a DataFrame
with data about each frog.
df_frog = pd.DataFrame(data={'ID': ['I', 'II', 'III', 'IV'],
'age': ['adult', 'adult', 'juvenile', 'juvenile'],
'SVL (mm)': [63, 70, 28, 31],
'weight (g)': [63.1, 72.7, 12.7, 12.7],
'species': ['cross', 'cross', 'cranwelli', 'cranwelli']})
Note: This is a good exercise in searching through Pandas documentation and other online resources, such as StackOverflow. Remember, much of your programming efforts are spent searching through documentation and the internet.
There are lots of ways to solve this problem. This is a good exercise in searching through the Pandas documentation and other online resources, such as Stack Overflow. Remember, much of your programming efforts are spent searching through documentation and the internet.
After you have added this information to the data frame, make a scatter plot of adhesive force versus impact force and color the points by whether the frog is a juvenile or adult.
Peter and Rosemary Grant have been working on the Galápagos island of Daphne Major for over forty years. During this time, they have collected lots and lots of data about physiological features of finches. In 2014, they published a book with a summary of some of their major results (Grant P. R., Grant B. R., 40 years of evolution. Darwin's finches on Daphne Major Island, Princeton University Press, 2014). They made their data from the book publicly available via the Dryad Digital Repository.
We will investigate their measurements of beak depth (the distance, top to bottom, of a closed beak) and beak length (base to tip on the top) of Darwin's finches. We will look at data from two species, Geospiza fortis and Geospiza scandens. The Grants provided data on the finches of Daphne for the years 1973, 1975, 1987, 1991, and 2012. I have included the data in the files grant_1973.csv
, grant_1975.csv
, grant_1987.csv
, grant_1991.csv
, and grant_2012.csv
. They are in almost exactly the same format is in the Dryad repository; I have only deleted blank entries at the end of the files.
Note: If you want to skip the wrangling (which is very valuable experience), you can go directly to part (d). You can load in the DataFrame
you generate in parts (a) through (c) from the file ~/git/bootcamp/data/grant_complete.csv
.
a) Load each of the files into separate Pandas DataFrame
s. You might want to inspect the file first to make sure you know what character the comments start with and if there is a header row.
b) We would like to merge these all into one DataFrame
. The problem is that they have different header names, and only the 1973 file has a year entry (called yearband
). This is common with real data. It is often a bit messy and requires some wrangling.
- First, change the name of the
yearband
column of the 1973 data toyear
. Also, make sure the year format is four digits, not two!- Next, add a
year
column to the other fourDataFrame
s. You want tidy data, so each row in theDataFrame
should have an entry for the year.Change the column names so that all the
DataFrame
s have the same column names. I would choose column names
['band', 'species', 'beak length (mm)', 'beak depth (mm)', 'year']
Concatenate the
DataFrame
s into a singleDataFrame
. Be careful with indices! If you usepd.concat()
, you will need to use theignore_index=True
kwarg. You might also need to use theaxis
kwarg.
c) The band
field gives the number of the band on the bird's leg that was used to tag it. Are some birds counted twice? Are they counted twice in the same year? Do you think you should drop duplicate birds from the same year? How about different years? My opinion is that you should drop duplicate birds from the same year and keep the others, but I would be open to discussion on that. To practice your Pandas skills, though, let's delete only duplicate birds from the same year from the DataFrame
. When you have made this DataFrame
, save it as a CSV file.
Hint: The DataFrame
methods duplicated()
and drop_duplicates()
will be useful.
After doing this work, it is worth saving your tidy DataFrame
in a CSV document. To this using the to_csv()
method of your DataFrame
. Since the indices are uninformative, you should use the index=False
kwarg. (I have already done this and saved it as ~/git/bootcamp/data/grant_complete.csv
, which will help you do the rest of the exercise if you have problems with this part.)
d) It is informative to plot the measurement of each bird's beak as a point in the beak depth-beak length plane. For the 1987 data, plot beak depth vs. beak width for Geospiza fortis and for Geospiza scandens. Can you see the species demarcation?
e) Do part (d) again for all years. (Hint: Check out the row
encoding, and/or read about faceting in the Altair docs). Describe what you see. Do you see the changes in the differences between species (presumably as a result of introgressive hybridization)? In your plots, make sure all plots have the same range on the axes.
The Anderson-Fisher data set is a famous data set collected by Edgar Anderson and promoted by Ronald Fisher for use in his technique of linear discriminant analysis in taxonometric problems. The data set is now a classic data set that is used in data analysis. In this problem, you will explore this data set and ways of looking at it with Pandas/Altair. The data set is available in ~/data/fisher_iris.csv
.
a) Generate a dash-dot plot of the petal width versus petal length. Why might this be a good way of visualizing this kind of data set?
b) Generate a matrix plot of this data set. What are the advantages of this kind of plot?
c) Explore for yourself! Come up with useful ways of plotting this multidimensional data set to help you explore it.