Lesson 29: Introduction to Pandas¶

This lesson was generated from a Jupyter notebook. You can download the notebook here.

# NumPy of course
import numpy as np

# Pandas, conventionally imported as pd
import pandas as pd

# This is how we import the module of Matplotlib we'll be using
import matplotlib.pyplot as plt

# Seaborn makes plots pretty!
import seaborn as sns

# PyBeeswarm for some better data display
import beeswarm

# Magic function to make matplotlib inline; other style specs must come AFTER
%matplotlib inline

# This enables SVG graphics inline (only use with static plots (non-Bokeh))
%config InlineBackend.figure_format = 'svg'

# Set JB's favorite Seaborn settings
rc={'lines.linewidth': 2, 'axes.labelsize': 18, 'axes.titlesize': 18, 
    'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)

Throughout your research career, you will undoubtedly need to handle data, possibly lots of data. The data comes in lots of formats, and you will spend much of your time munging (or wrangling) the data to get it into a usable form. We already did a little bit of munging when we sorted the C. elegans egg cross sectional areas to be able to compute cumulative histograms.

Pandas is the primary tool in the SciPy stack for handling data. Its primary object, the DataFrame is extremely useful in munging data. We will explore some of that functionality here, and will put it to use in the next lesson.

Munging the egg cross-sectional area data¶

It os often useful to use the same data set to learn new things, since you are already familiar with the data. We'll keep using the egg cross-sectional area data set. In this case, we will sue Pandas to import the data. We use the very handy (and faaaaar more powerful than np.loadtxt()) function pd.read_csv().

# Read in data files with pandas
df_high = pd.read_csv('../data/xa_high_food.csv', comment='#')
df_low = pd.read_csv('../data/xa_low_food.csv', comment='#')

Almost the same syntax as np.loadtxt(), but notice that the kwarg is comment and not comments. Let's now look at what we have.

df_high

Whoa! IPython is displaying this is a different way! What kind of data type is this?

type(df_low)

pandas.core.frame.DataFrame

Pandas has loaded the data in as a DataFrame. As I mentioned before, this is the central object for handling data.

We see that we have a bold column heading (1683) and bold row indices. Pandas interpreted the first non-comment line as a label for a column. We need to tell it not to do that using the header kwarg. Let's re-load the data.

# Read in data files with pandas with no row headings.
df_high = pd.read_csv('../data/xa_high_food.csv', comment='#', header=None)
df_low = pd.read_csv('../data/xa_low_food.csv', comment='#', header=None)

# Now look at it
df_high

We see that Pandas has assigned a column heading of 0 to the column of data. What happens if we index it?

df_high[0]

0     1683
1     2061
2     1792
3     1852
4     2091
5     1781
6     1912
7     1802
8     1751
9     1731
10    1892
11    1951
12    1809
13    1683
14    1787
15    1840
16    1821
17    1910
18    1930
19    1800
20    1833
21    1683
22    1671
23    1680
24    1692
25    1800
26    1821
27    1882
28    1642
29    1749
30    1712
31    1661
32    1701
33    2141
34    1863
35    1752
36    1740
37    1721
38    1660
39    1930
40    2030
41    1851
42    2131
43    1828
Name: 0, dtype: int64

Whoa again! What is this?

type(df_high[0])

pandas.core.series.Series

A Pandas Series is basically a one-dimensional DataFrame. When we index it, we get what we might expect.

df_high[0][0]

1683

An aside: Football and learning what Pandas Series and DataFrames are¶

We can think of a Pandas Series as a generalized NumPy array. NumPy arrays are indexed with integers, but Pandas Series may be indexed with anything. As an example, we'll create a Series from a dictionary.

# Dictionary of top men's World Cup scorers and how many goals
wc_dict = {'Klose': 16,
           'Ronaldo': 15,
           'Müller': 14,
           'Fontaine': 13,
           'Pelé': 12,
           'Koscis': 11,
           'Klinsmann': 11}

# Create a Series from the dictionary
s_goals = pd.Series(wc_dict)

# Take a look
s_goals

Fontaine     13
Klinsmann    11
Klose        16
Koscis       11
Müller       14
Pelé         12
Ronaldo      15
dtype: int64

We'll notice that the indexing now works like the dictionary that created it.

s_goals['Klose']

16

Now, what if we wanted to add more data to this, such as the country that each player represented? We can make another Series.

# Dictionary of nations
nation_dict = {'Klose': 'Germany',
               'Ronaldo': 'Brazil',
               'Müller': 'Germany',
               'Fontaine': 'France',
               'Pelé': 'Brazil',
               'Koscis': 'Hungary',
               'Klinsmann': 'Germany'}

# Series with nations
s_nation = pd.Series(nation_dict)

# Look at it
s_nation

Fontaine      France
Klinsmann    Germany
Klose        Germany
Koscis       Hungary
Müller       Germany
Pelé          Brazil
Ronaldo       Brazil
dtype: object

Now, we can combine these into a DataFrame. We use pd.DataFrame() to instantiate a DataFrame, passing in a dictionary whose keys are the column headers and values are the series we're building into a DataFrame.

# Combine into a DataFrame
df_wc = pd.DataFrame({'nation': s_nation, 'goals': s_goals})

# Take a look
df_wc

Notice now that the DataFrame is indexed by player names. The column headings are goals and nation. When using the bracket notation, we cannot directly use the indices, only the column headings.

df_wc['Fontaine']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-13-ae1334ed016e> in <module>()
----> 1 df_wc['Fontaine']

/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1795             return self._getitem_multilevel(key)
   1796         else:
-> 1797             return self._getitem_column(key)
   1798 
   1799     def _getitem_column(self, key):

/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1802         # get column
   1803         if self.columns.is_unique:
-> 1804             return self._get_item_cache(key)
   1805 
   1806         # duplicate columns & possible reduce dimensionaility

/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1082         res = cache.get(item)
   1083         if res is None:
-> 1084             values = self._data.get(item)
   1085             res = self._box_item_values(item, values)
   1086             cache[item] = res

/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   2849 
   2850             if not isnull(item):
-> 2851                 loc = self.items.get_loc(item)
   2852             else:
   2853                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/index.py in get_loc(self, key, method)
   1570         """
   1571         if method is None:
-> 1572             return self._engine.get_loc(_values_from_object(key))
   1573 
   1574         indexer = self.get_indexer([key], method=method)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12280)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12231)()

KeyError: 'Fontaine'

But we can index by columns.

df_wc['goals']

Fontaine     13
Klinsmann    11
Klose        16
Koscis       11
Müller       14
Pelé         12
Ronaldo      15
Name: goals, dtype: int64

If we just just want the goals and nation of Fontaine, we would use the .ix method.

df_wc.ix['Fontaine']

goals         13
nation    France
Name: Fontaine, dtype: object

We can only look at German players, for instance using similar Boolean indexing as with NumPy arrays.

df_wc[df_wc['nation'] == 'Germany']

As a final demonstration of the supreme power of DataFrames, they come with plotting methods! We can make a bar chart of the goal scorers, first sorting them in order of number of goals scored.

# Make a bar chart of goals
ax = df_wc.sort('goals').plot(y='goals', kind='bar', legend=False)
plt.ylabel('goals')
ax.xaxis.grid(False)

Combining the cross-sectional area data into a `DataFrame`¶

Now, back to our cross-sectional area data. We can combine these data into a DataFrame as well, even though they have differing numbers of data points. However, because that df_low and df_high are not Series, we cannot just use the method we used for the footballers. Instead, we will use the pd.concat() function to concatenate DataFrames.

An added complication is that, as they are now, the two DataFrames have the same column heading of 0. We should change that for each before concatenating.

# Change column headings
df_low.columns = ['low']
df_high.columns = ['high']

# Take a look
df_high

Now, we can concatenate the two DataFrames into one. We just pass a tuple with the DataFrames we want to concatenate as an argument. We specify the kwarg axis = 1 to indincate that we want to have two columns, instead of just appending the second DataFrame at the bottom of the first (as we would get with axis = 0).

# Concatenate DataFrames
df = pd.concat((df_low, df_high), axis=1)

# See the result
df

Note that the shorter of the two columns was filled with NaN, which means "not a number." This DataFrame structure makes it easier to make plots. For example, you may have wondered why we did not make boxplots from the data before. The reason was because it was too syntactically painful without the elegance of DataFrames. We can just directly use the built-in plotting capabilities.

ax = df.plot(kind='box')
plt.ylabel('X-sect. area (µm$^2$)')
ax.xaxis.grid(False)

Here, the red lines are the median, the top and bottom of the box are the 75th and 25th percentile, respectively, and the whiskers cover the total extent of the measurements.

Outputting a new CSV file¶

We can now write out a single CSV file with the DataFrame. We use the index kwarg to ask Pandas now to explicity write the indices to the file.

# Write out DataFrame
df.to_csv('xa_combined.csv', index=False)

Now when we load the data, we get a convenient DataFrame.

# Load DataFrame
df_reloaded = pd.read_csv('xa_combined.csv')

# Take a look
df_reloaded

Beeswarm plots¶

Besides the fact that I hate dashed lines when they are unnecessary, I also believe that you should plot all points when the number of points is below about 100. A good way to do this is a beeswarm plot. We'll use pybeeswarm to do this.

# pybeeswarm needs list of NumPy arrays for plotting
list_of_data = [df['low'].dropna(), df['high'].dropna()]
labels = ['low', 'high']

# Generate a beeswarm plot
_, ax = beeswarm.beeswarm(list_of_data, labels=labels)
plt.ylabel('X-sect. area (µm$^2$)')
ax.xaxis.grid(False)

A note on tidy data¶

Hadley Wickham wrote a great article in favor of "tidy data." Tidy DataFrames follow the rules:

Each variable is a column.
Each observation is a row.
Each type of observation has its own separate DataFrame.

This is less pretty to visualize as a table, but we rarely look at data in tables. In my brief experience since taking Wickham's words to heart, I can say that keeping tidy DataFrames makes life much easier!

A tidy DataFrame of the cross-sectional area data would have two columns, "food density" and "cross-sectional area (sq micron)." Remember, each variable is a column. Each row, then corresponds to an individual measurement. This results in a lot of repeated entries (we'll have 44 highs), but it is very clean logically and easy to work with. Let's tidy our DataFrame. The pd.melt() function makes this easy.

pd.melt(df, var_name='food density', 
        value_name='cross-sectional area (sq micron)').dropna()

Saving this DataFrame takes a bit more storage, but it is much clearer. For example, it does not rely on comments to give the units of the measurements.

Not to worry, the default plotting still works with this format.

ax = df.plot(kind='box')
plt.ylabel('X-sect. area (µm$^2$)')
ax.xaxis.grid(False)

	1683
0	2061
1	1792
2	1852
3	2091
4	1781
5	1912
6	1802
7	1751
8	1731
9	1892
10	1951
11	1809
12	1683
13	1787
14	1840
15	1821
16	1910
17	1930
18	1800
19	1833
20	1683
21	1671
22	1680
23	1692
24	1800
25	1821
26	1882
27	1642
28	1749
29	1712
30	1661
31	1701
32	2141
33	1863
34	1752
35	1740
36	1721
37	1660
38	1930
39	2030
40	1851
41	2131
42	1828

	goals	nation
Fontaine	13	France
Klinsmann	11	Germany
Klose	16	Germany
Koscis	11	Hungary
Müller	14	Germany
Pelé	12	Brazil
Ronaldo	15	Brazil

	low	high
0	1840	1683
1	2090	2061
2	2169	1792
3	1988	1852
4	2212	2091
5	2339	1781
6	1989	1912
7	2144	1802
8	2290	1751
9	1920	1731
10	2280	1892
11	1809	1951
12	2158	1809
13	1800	1683
14	2133	1787
15	2060	1840
16	2160	1821
17	2001	1910
18	2030	1930
19	2088	1800
20	1951	1833
21	2460	1683
22	2021	1671
23	2010	1680
24	2139	1692
25	2160	1800
26	2106	1821
27	2171	1882
28	2113	1642
29	2179	1749
30	1890	1712
31	2179	1661
32	2021	1701
33	1969	2141
34	2150	1863
35	1900	1752
36	2267	1740
37	1711	1721
38	1901	1660
39	2114	1930
40	2112	2030
41	2361	1851
42	2130	2131
43	2061	1828
44	2121	NaN
45	1832	NaN
46	2210	NaN
47	2130	NaN
48	2153	NaN
49	2009	NaN
50	2100	NaN
51	2252	NaN
52	2143	NaN
53	2252	NaN
54	2222	NaN
55	2121	NaN
56	2409	NaN