(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This lesson was generated from a Jupyter notebook. You can download the notebook here.
import numpy as np
# Pandas, conventionally imported as pd
import pandas as pd
# Plotting modules and settings.
import matplotlib.pyplot as plt
import seaborn as sns
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728',
'#9467bd', '#8c564b', '#e377c2', '#7f7f7f',
'#bcbd22', '#17becf']
sns.set(style='whitegrid', palette=colors, rc={'axes.labelsize': 16})
# The following is specific Jupyter notebooks
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}
Throughout your research career, you will undoubtedly need to handle data, possibly lots of data. The data comes in lots of formats, and you will spend much of your time wrangling the data to get it into a usable form. We already did a little bit of wrangling when we sorted the C. elegans egg cross sectional areas to be able to compute cumulative histograms.
Pandas is the primary tool in the SciPy stack for handling data. Its primary object, the DataFrame
is extremely useful in wrangling data. We will explore some of that functionality here, and will put it to use in the next lesson.
It is often useful to use the same data set to learn new techniques, since you are already familiar with the data. We'll keep using the egg cross-sectional area data set. In this case, we will use Pandas
to import the data. We use the very handy (and faaaaar more powerful than np.loadtxt()
) function pd.read_csv()
.
# Read in data files with pandas
df_high = pd.read_csv('data/xa_high_food.csv', comment='#')
df_low = pd.read_csv('data/xa_low_food.csv', comment='#')
Almost the same syntax as np.loadtxt()
, but notice that the kwarg is comment
and not comments
. Let's now look at what we have.
df_high
Whoa! IPython is displaying this is a different way! What kind of data type is this?
type(df_low)
Pandas has loaded the data in as a DataFrame
. As I mentioned before, this is the central object for handling data.
We see above that we have a bold column heading (1683
) and bold row indices. Pandas interpreted the first non-comment line as a label for a column. We need to tell it not to do that using the header
kwarg. Let's re-load the data.
# Read in data files with pandas with no row headings.
df_high = pd.read_csv('data/xa_high_food.csv', comment='#', header=None)
df_low = pd.read_csv('data/xa_low_food.csv', comment='#', header=None)
# Now look at it
df_high
We see that Pandas has assigned a column heading of 0
to the column of data. What happens if we index it?
df_high[0]
Whoa again! What is this?
type(df_high[0])
A Pandas Series
is basically a one-dimensional DataFrame
. When we index it, we get what we might expect.
df_high[0][0]
We can think of a Pandas Series
as a generalized NumPy array. NumPy arrays are indexed with integers, but Pandas Series
may be indexed with anything. As an example, we'll create a Series
from a dictionary.
# Dictionary of top men's World Cup scorers and how many goals
wc_dict = {'Klose': 16,
'Ronaldo': 15,
'Müller': 14,
'Fontaine': 13,
'Pelé': 12,
'Kocsis': 11,
'Klinsmann': 11}
# Create a Series from the dictionary
s_goals = pd.Series(wc_dict)
# Take a look
s_goals
We'll notice that the indexing now works like the dictionary that created it.
s_goals['Klose']
Now, what if we wanted to add more data to this, such as the country that each player represented? We can make another Series
.
# Dictionary of nations
nation_dict = {'Klose': 'Germany',
'Ronaldo': 'Brazil',
'Müller': 'Germany',
'Fontaine': 'France',
'Pelé': 'Brazil',
'Kocsis': 'Hungary',
'Klinsmann': 'Germany'}
# Series with nations
s_nation = pd.Series(nation_dict)
# Look at it
s_nation
Now, we can combine these into a DataFrame
. We use pd.DataFrame()
to instantiate a DataFrame
, passing in a dictionary whose keys are the column headers and values are the series we're building into a DataFrame
.
# Combine into a DataFrame
df_wc = pd.DataFrame({'nation': s_nation, 'goals': s_goals})
# Take a look
df_wc
Notice now that the DataFrame
is indexed by player names. The column headings are goals and nation. When using the bracket notation, we cannot directly use the indices, only the column headings.
df_wc['Fontaine']
But we can index by columns.
df_wc['goals']
If we just just want the goals and nation of Fontaine
, we would use the .loc
attribute of a DataFrame
, which allows slice-like indexing.
df_wc.loc['Fontaine', :]
When using .loc
, the first entry is the index and the second is the column. So, we asked Pandas to give us the row indexed with 'Fontaine'
and all columns (using the colon, as in NumPy arrays).
We can only look at German players, for instance using similar Boolean indexing as with NumPy arrays.
df_wc.loc[df_wc['nation']=='Germany', :]
Now, back to our cross-sectional area data. We can combine these data into a DataFrame
as well, even though they have differing numbers of data points. However, because that df_low
and df_high
are not Series
but DataFrame
s, we cannot just use the method we used for the footballers. Instead, we will use the pd.concat()
function to concatenate DataFrame
s.
An added complication is that, as they are now, the two DataFrames
have the same column heading of 0
. We should change that for each before concatenating.
# Change column headings
df_low.columns = ['low']
df_high.columns = ['high']
# Take a look
df_high
Now, we can concatenate the two DataFrame
s into one. We just pass a tuple with the DataFrame
s we want to concatenate as an argument. We specify the kwarg axis=1
to indicate that we want to have two columns, instead of just appending the second DataFrame
at the bottom of the first (as we would get with axis=0
).
# Concatenate DataFrames
df = pd.concat((df_low, df_high), axis=1)
# See the result
df
Note that the shorter of the two columns was filled with NaN
, which means "not a number."
Hadley Wickham wrote a great article in favor of "tidy data." Tidy DataFrame
s follow the rules:
DataFrame
.This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy DataFrame
is almost always much easier to work with than non-tidy formats.
A tidy DataFrame
of the cross-sectional area data has two columns, "food density" and "cross-sectional area (sq micron)." Remember, each variable is a column. Each row, then corresponds to an individual measurement. This results in a lot of repeated entries (we'll have 44 high
s), but it is very clean logically and easy to work with. We will see later that we can use convenient and power plotting techniques if the data are tide. Let's tidy our DataFrame
. The pd.melt()
function makes this easy.
df = pd.melt(df, var_name='food density',
value_name='cross-sectional area (sq micron)').dropna()
df
Now, pulling out records we want is easy. Say we want all measurements from mothers with high food content where the eggs had a cross-sectional area of more than 2000 µm$^2$. We can use Boolean indexing with .loc
.
# Specify indices we want (note parentheses holding each Boolean)
inds = (df['food density'] == 'high') & (df['cross-sectional area (sq micron)'] > 2000)
# Pull out areas
df.loc[inds, 'cross-sectional area (sq micron)']
We get a Series
with the measurements we are after. Notice that the indices of the entries in the Series
are preserved from the original DataFrame
.
We can now write out a single CSV file with the DataFrame
. We use the index
kwarg to ask Pandas not to explicitly write the indices to the file.
# Write out DataFrame
df.to_csv('xa_combined.csv', index=False)
Let's take a look at what this file looks like.
!head xa_combined.csv
The headers are now included. Now when we load the data, we get a convenient, tidy DataFrame
.
# Load DataFrame
df_reloaded = pd.read_csv('xa_combined.csv')
# Take a look
df_reloaded