Exercise 1.1: Practice with Pandas and Palmer’s Penguins

Data set download


The Palmer penguins data set is a nice data set with which to practice various data science skills. For this exercise, we will use as subset of it, which you can download here: https://s3.amazonaws.com/bebi103.caltech.edu/data/penguins_subset.csv. The data set consists of measurements of three different species of penguins acquired at the Palmer Station in Antarctica. The measurements were made between 2007 and 2009 by Kristen Gorman.

a) Load the data set into a Pandas DataFrame called df. You will need to use the header=[0,1] kwarg of pd.read_csv() to load the data set in properly.

b) Take a look at df. Is it tidy? Why or why not?

c) Perform the following operations to make a new DataFrame from the original one you loaded in exercise 1 to generate a new DataFrame. You do not need to worry about what these operations do (you can learn about tidying data frames here), just do them to answer this question: Is the resulting data frame df_tidy tidy? Why or why not?

df_tidy = df.stack(
    level=0
).sort_index(
    level=1
).reset_index(
    level=1
).rename(
    columns={"level_1": "species"}
)

d) Using df_tidy, slice out all of the bill lengths for Gentoo penguins as a Numpy array.

e) Make a new data frame containing the mean measured bill depth, bill length, body mass in kg, and flipper length for each species. You can use millimeters for all length measurements.

f) Save df_tidy as a file named penguins_subset_tidy.csv.