Exercise 4

This exercise was generated from a Jupyter notebook. You can download the notebook here.

Problem 4.1: Setting up a GitHub account

Having a local repository is useful, to be sure, but having a remote repository really brings you some power. Importantly, it allows very easy sharing and logging of work done my many people. Secondly, having a remote repository gets you extensive backups for free. With the exceptions of presentations for talks, I do all of my work within Git repositories hosted at GitHub, one of the most popular hosting services. BitBucket is another popular hosting service.

If your lab does not use remote version control repositories, start evangelizing, and get one!

In this problem, you will set up your own GitHub account and host the repo you set up for bootcamp there. For the rest of bootcamp, you should keep your work under version control.

Note: you usually do not keep outputs of code under version control unless it serves as a basis for comparison for unit tests. You also don't keep things you can't edit, like images, under version control. The repositories can tend to get very large.

a) Go to the GitHub website and set up an account. Make sure you get an academic account that lets you have free private repositories and the Student Developer Pack which gives you all sorts of goodies.

b) Follow the instructions here to set up your first repository (bootcamp). You will probably want this repository to be private.

c) Edit your README.md file with a description of what your bootcamp repository is all about. Then add and commit your changes.

d) Push your changes to the remote repository.

git push origin master

e) Now try to pull from the repository.

git pull

This pulls in all changes that have been made. Since your local versions now match what is in the remote repo, git will tell you that everything is up to date.

Peter and Rosemary Grant have been working on the Galápagos island of Daphne for over forty years. During this time, they have collected lots and lots of data about physiological features of finches. Last year, they published a book with a summary of some of their major results (Grant P. R., Grant B. R., Data from: 40 years of evolution. Darwin's finches on Daphne Major Island, Princeton University Press, 2014). They made their data from the book publicly available via the Dryad Digital Repository.

We will investigate their measurements of beak depth (the distance, top to bottom, of a closed beak) and beak length (base to tip on the top) of Darwin's finches. We will look at data from two species, Geospiza fortis and Geospiza scandens. The Grants provided data on the finches of Daphne for the years 1973, 1975, 1987, 1991, and 2012. You can download these files as they appear in the Dryad repo (though I changed file names and supplied a reference at the top of each file) here.

a) Load each of the files 1973.csv, 1975.csv, 1987.csv, 1991.csv, and 2012.csv into separate Pandas DataFrames. You might want to inspect the file first to make sure you know what character the comments start with and if there is a header row.

b) We would like to merge these all into one DataFrame. The problem is that they have different header names, and only the 1973 file has a year entry (called yearband). This is common with real data. It is often a bit messy and requires some munging.

  1. First, change the name of the yearband column of the 1973 data to year. Also, make sure the year format is four digits, not two!
  2. Next, add a year column to the other four DataFrames. You want tidy data, so each row in the DataFrame should have an entry for the year.
  3. Change the column names so that all the DataFrames have the same column names. I would choose column names

    ['band', 'species', 'beak length (mm)', 'beak depth (mm)', 'year']

  4. Concatenate the DataFrames into a singe DataFrame.

c) The band fields gives the number of the band on the bird's leg that was used to tag it. Are some birds counted twice? Are they counted twice in the same year? Do you think you should drop duplicate birds from the same year? How about different years? My opinion is that you should drop dupicate birds from the same year and keep the others, but I would be open to discussion on that. To practice your Pandas skills, though, let's delete only duplicate birds from the same year from the DataFrame. When you have make this DataFrame, save it as a CSV file.

Hint: The DataFrame methods duplicated() and drop_duplicates() will be useful.

d) Plot a histogram of beak depths of Geospiza fortis specimens measured in 1987. Plot a histogram of the beak depths of Geospiza scandens from the same year. These histograms should be on the same plot. On another plot, plot a histogram of beak lengths for the two species in 1987. Do you see a striking phenotypic difference?

e) Perhaps a more informative plot is to plot the measurement of each bird's beak as a point in the beak depth-beak length plane. For the 1987 data, plot beak depth vs. beak width for Geospiza fortis as blue dots, and for Geospiza scandens as red dots. Can you see the species demarcation?

f) Do part (e) again for all years. Describe what you see. Do you see the changes in the differences between species (presumably as a result of hybridization)? In your plots, make sure all plots have the same range on the axes.

Problem 4.3: Mapping Protein to Nucleotide Sequences

The public sequence databases do an excellent job of connecting protein sequences with each other and other metadata, like sequence homology families, 3d structures, etc. However, the protein sequence databases have been structured such that nucleotide coding sequence is divorced from the protein sequence in the databases. Neither the Uniprot webpage nor the associated NCBI protein webpages for the same sequence provide a direct link to the coding sequence. Uniprot acknolwedges this issue but has no plans to change. (See here.)

Here, let's explore two methods to map a protein sequence to its nucleotide coding sequence.

a) Use a BLAST program to search for the E. coli MscK coding nucleotide sequence given by the identifier P77338. Provide justification for the program you use and how you identify the hit. Can you be sure that the hit corresponds to the coding sequence and not one that has silent mutations, e.g., same protein sequence but different mRNA sequence?

b) Use NCBI's Entrez service(s) to retrieve the MscK coding sequence without using BLAST!

  • Hint 1: What does the CDS feature in a Genbank file give you?

  • Hint 2: You will need to do more than one Entrez operation.

c) Compare and contrast the two methods. What are the advantages/disadvantages of each?