Lesson 30: Introduction to Pandas

(c) 2016 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This lesson was generated from a Jupyter notebook. You can download the notebook here.

In [26]:
import numpy as np

# Pandas, conventionally imported as pd
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
rc={'lines.linewidth': 2, 'axes.labelsize': 18, 'axes.titlesize': 18}
sns.set(rc=rc)

# The following is specific Jupyter notebooks
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}

Throughout your research career, you will undoubtedly need to handle data, possibly lots of data. The data comes in lots of formats, and you will spend much of your time munging (or wrangling) the data to get it into a usable form. We already did a little bit of munging when we sorted the C. elegans egg cross sectional areas to be able to compute cumulative histograms.

Pandas is the primary tool in the SciPy stack for handling data. Its primary object, the DataFrame is extremely useful in munging data. We will explore some of that functionality here, and will put it to use in the next lesson.

Munging the egg cross-sectional area data

It is often useful to use the same data set to learn new things, since you are already familiar with the data. We'll keep using the egg cross-sectional area data set. In this case, we will use Pandas to import the data. We use the very handy (and faaaaar more powerful than np.loadtxt()) function pd.read_csv().

In [2]:
# Read in data files with pandas
df_high = pd.read_csv('data/xa_high_food.csv', comment='#')
df_low = pd.read_csv('data/xa_low_food.csv', comment='#')

Almost the same syntax as np.loadtxt(), but notice that the kwarg is comment and not comments. Let's now look at what we have.

In [3]:
df_high
Out[3]:
1683
0 2061
1 1792
2 1852
3 2091
4 1781
5 1912
6 1802
7 1751
8 1731
9 1892
10 1951
11 1809
12 1683
13 1787
14 1840
15 1821
16 1910
17 1930
18 1800
19 1833
20 1683
21 1671
22 1680
23 1692
24 1800
25 1821
26 1882
27 1642
28 1749
29 1712
30 1661
31 1701
32 2141
33 1863
34 1752
35 1740
36 1721
37 1660
38 1930
39 2030
40 1851
41 2131
42 1828

Whoa! IPython is displaying this is a different way! What kind of data type is this?

In [4]:
type(df_low)
Out[4]:
pandas.core.frame.DataFrame

Pandas has loaded the data in as a DataFrame. As I mentioned before, this is the central object for handling data.

We see that we have a bold column heading (1683) and bold row indices. Pandas interpreted the first non-comment line as a label for a column. We need to tell it not to do that using the header kwarg. Let's re-load the data.

In [5]:
# Read in data files with pandas with no row headings.
df_high = pd.read_csv('data/xa_high_food.csv', comment='#', header=None)
df_low = pd.read_csv('data/xa_low_food.csv', comment='#', header=None)

# Now look at it
df_high
Out[5]:
0
0 1683
1 2061
2 1792
3 1852
4 2091
5 1781
6 1912
7 1802
8 1751
9 1731
10 1892
11 1951
12 1809
13 1683
14 1787
15 1840
16 1821
17 1910
18 1930
19 1800
20 1833
21 1683
22 1671
23 1680
24 1692
25 1800
26 1821
27 1882
28 1642
29 1749
30 1712
31 1661
32 1701
33 2141
34 1863
35 1752
36 1740
37 1721
38 1660
39 1930
40 2030
41 1851
42 2131
43 1828

We see that Pandas has assigned a column heading of 0 to the column of data. What happens if we index it?

In [6]:
df_high[0]
Out[6]:
0     1683
1     2061
2     1792
3     1852
4     2091
5     1781
6     1912
7     1802
8     1751
9     1731
10    1892
11    1951
12    1809
13    1683
14    1787
15    1840
16    1821
17    1910
18    1930
19    1800
20    1833
21    1683
22    1671
23    1680
24    1692
25    1800
26    1821
27    1882
28    1642
29    1749
30    1712
31    1661
32    1701
33    2141
34    1863
35    1752
36    1740
37    1721
38    1660
39    1930
40    2030
41    1851
42    2131
43    1828
Name: 0, dtype: int64

Whoa again! What is this?

In [7]:
type(df_high[0])
Out[7]:
pandas.core.series.Series

A Pandas Series is basically a one-dimensional DataFrame. When we index it, we get what we might expect.

In [8]:
df_high[0][0]
Out[8]:
1683

An aside: Football and learning what Pandas Series and DataFrames are

We can think of a Pandas Series as a generalized NumPy array. NumPy arrays are indexed with integers, but Pandas Series may be indexed with anything. As an example, we'll create a Series from a dictionary.

In [9]:
# Dictionary of top men's World Cup scorers and how many goals
wc_dict = {'Klose': 16,
           'Ronaldo': 15,
           'Müller': 14,
           'Fontaine': 13,
           'Pelé': 12,
           'Kocsis': 11,
           'Klinsmann': 11}

# Create a Series from the dictionary
s_goals = pd.Series(wc_dict)

# Take a look
s_goals
Out[9]:
Fontaine     13
Klinsmann    11
Klose        16
Kocsis       11
Müller       14
Pelé         12
Ronaldo      15
dtype: int64

We'll notice that the indexing now works like the dictionary that created it.

In [10]:
s_goals['Klose']
Out[10]:
16

Now, what if we wanted to add more data to this, such as the country that each player represented? We can make another Series.

In [11]:
# Dictionary of nations
nation_dict = {'Klose': 'Germany',
               'Ronaldo': 'Brazil',
               'Müller': 'Germany',
               'Fontaine': 'France',
               'Pelé': 'Brazil',
               'Kocsis': 'Hungary',
               'Klinsmann': 'Germany'}

# Series with nations
s_nation = pd.Series(nation_dict)

# Look at it
s_nation
Out[11]:
Fontaine      France
Klinsmann    Germany
Klose        Germany
Kocsis       Hungary
Müller       Germany
Pelé          Brazil
Ronaldo       Brazil
dtype: object

Now, we can combine these into a DataFrame. We use pd.DataFrame() to instantiate a DataFrame, passing in a dictionary whose keys are the column headers and values are the series we're building into a DataFrame.

In [12]:
# Combine into a DataFrame
df_wc = pd.DataFrame({'nation': s_nation, 'goals': s_goals})

# Take a look
df_wc
Out[12]:
goals nation
Fontaine 13 France
Klinsmann 11 Germany
Klose 16 Germany
Kocsis 11 Hungary
Müller 14 Germany
Pelé 12 Brazil
Ronaldo 15 Brazil

Notice now that the DataFrame is indexed by player names. The column headings are goals and nation. When using the bracket notation, we cannot directly use the indices, only the column headings.

In [13]:
df_wc['Fontaine']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-13-ae1334ed016e> in <module>()
----> 1 df_wc['Fontaine']

/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1795             return self._getitem_multilevel(key)
   1796         else:
-> 1797             return self._getitem_column(key)
   1798 
   1799     def _getitem_column(self, key):

/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1802         # get column
   1803         if self.columns.is_unique:
-> 1804             return self._get_item_cache(key)
   1805 
   1806         # duplicate columns & possible reduce dimensionaility

/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1082         res = cache.get(item)
   1083         if res is None:
-> 1084             values = self._data.get(item)
   1085             res = self._box_item_values(item, values)
   1086             cache[item] = res

/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   2849 
   2850             if not isnull(item):
-> 2851                 loc = self.items.get_loc(item)
   2852             else:
   2853                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/index.py in get_loc(self, key, method)
   1570         """
   1571         if method is None:
-> 1572             return self._engine.get_loc(_values_from_object(key))
   1573 
   1574         indexer = self.get_indexer([key], method=method)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12280)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12231)()

KeyError: 'Fontaine'

But we can index by columns.

In [13]:
df_wc['goals']
Out[13]:
Fontaine     13
Klinsmann    11
Klose        16
Kocsis       11
Müller       14
Pelé         12
Ronaldo      15
Name: goals, dtype: int64

If we just just want the goals and nation of Fontaine, we would use the .loc attribute of a DataFrame, which allows slice-like indexing.

In [27]:
df_wc.loc['Fontaine', :]
Out[27]:
goals         13
nation    France
Name: Fontaine, dtype: object

When using .loc, the first entry is the index and the second is the column. So, we asked Pandas to give us the row indexed with 'Fontaine' and all columns (using the colon, as in NumPy arrays).

We can only look at German players, for instance using similar Boolean indexing as with NumPy arrays.

In [29]:
df_wc.loc[df_wc['nation']=='Germany', :]
Out[29]:
goals nation
Klinsmann 11 Germany
Klose 16 Germany
Müller 14 Germany

Combining the cross-sectional area data into a DataFrame

Now, back to our cross-sectional area data. We can combine these data into a DataFrame as well, even though they have differing numbers of data points. However, because that df_low and df_high are not Series, we cannot just use the method we used for the footballers. Instead, we will use the pd.concat() function to concatenate DataFrames.

An added complication is that, as they are now, the two DataFrames have the same column heading of 0. We should change that for each before concatenating.

In [16]:
# Change column headings
df_low.columns = ['low']
df_high.columns = ['high']

# Take a look
df_high
Out[16]:
high
0 1683
1 2061
2 1792
3 1852
4 2091
5 1781
6 1912
7 1802
8 1751
9 1731
10 1892
11 1951
12 1809
13 1683
14 1787
15 1840
16 1821
17 1910
18 1930
19 1800
20 1833
21 1683
22 1671
23 1680
24 1692
25 1800
26 1821
27 1882
28 1642
29 1749
30 1712
31 1661
32 1701
33 2141
34 1863
35 1752
36 1740
37 1721
38 1660
39 1930
40 2030
41 1851
42 2131
43 1828

Now, we can concatenate the two DataFrames into one. We just pass a tuple with the DataFrames we want to concatenate as an argument. We specify the kwarg axis = 1 to indicate that we want to have two columns, instead of just appending the second DataFrame at the bottom of the first (as we would get with axis = 0).

In [17]:
# Concatenate DataFrames
df = pd.concat((df_low, df_high), axis=1)

# See the result
df
Out[17]:
low high
0 1840 1683.0
1 2090 2061.0
2 2169 1792.0
3 1988 1852.0
4 2212 2091.0
5 2339 1781.0
6 1989 1912.0
7 2144 1802.0
8 2290 1751.0
9 1920 1731.0
10 2280 1892.0
11 1809 1951.0
12 2158 1809.0
13 1800 1683.0
14 2133 1787.0
15 2060 1840.0
16 2160 1821.0
17 2001 1910.0
18 2030 1930.0
19 2088 1800.0
20 1951 1833.0
21 2460 1683.0
22 2021 1671.0
23 2010 1680.0
24 2139 1692.0
25 2160 1800.0
26 2106 1821.0
27 2171 1882.0
28 2113 1642.0
29 2179 1749.0
30 1890 1712.0
31 2179 1661.0
32 2021 1701.0
33 1969 2141.0
34 2150 1863.0
35 1900 1752.0
36 2267 1740.0
37 1711 1721.0
38 1901 1660.0
39 2114 1930.0
40 2112 2030.0
41 2361 1851.0
42 2130 2131.0
43 2061 1828.0
44 2121 NaN
45 1832 NaN
46 2210 NaN
47 2130 NaN
48 2153 NaN
49 2009 NaN
50 2100 NaN
51 2252 NaN
52 2143 NaN
53 2252 NaN
54 2222 NaN
55 2121 NaN
56 2409 NaN

Note that the shorter of the two columns was filled with NaN, which means "not a number."

Outputting a new CSV file

We can now write out a single CSV file with the DataFrame. We use the index kwarg to ask Pandas not to explicitly write the indices to the file.

In [20]:
# Write out DataFrame
df.to_csv('xa_combined.csv', index=False)

Let's take a look at what this file looks like.

In [21]:
!head xa_combined.csv
low,high
1840,1683.0
2090,2061.0
2169,1792.0
1988,1852.0
2212,2091.0
2339,1781.0
1989,1912.0
2144,1802.0
2290,1751.0

The headers are now included. Now when we load the data, we get a convenient DataFrame.

In [22]:
# Load DataFrame
df_reloaded = pd.read_csv('xa_combined.csv')

# Take a look
df_reloaded
Out[22]:
low high
0 1840 1683.0
1 2090 2061.0
2 2169 1792.0
3 1988 1852.0
4 2212 2091.0
5 2339 1781.0
6 1989 1912.0
7 2144 1802.0
8 2290 1751.0
9 1920 1731.0
10 2280 1892.0
11 1809 1951.0
12 2158 1809.0
13 1800 1683.0
14 2133 1787.0
15 2060 1840.0
16 2160 1821.0
17 2001 1910.0
18 2030 1930.0
19 2088 1800.0
20 1951 1833.0
21 2460 1683.0
22 2021 1671.0
23 2010 1680.0
24 2139 1692.0
25 2160 1800.0
26 2106 1821.0
27 2171 1882.0
28 2113 1642.0
29 2179 1749.0
30 1890 1712.0
31 2179 1661.0
32 2021 1701.0
33 1969 2141.0
34 2150 1863.0
35 1900 1752.0
36 2267 1740.0
37 1711 1721.0
38 1901 1660.0
39 2114 1930.0
40 2112 2030.0
41 2361 1851.0
42 2130 2131.0
43 2061 1828.0
44 2121 NaN
45 1832 NaN
46 2210 NaN
47 2130 NaN
48 2153 NaN
49 2009 NaN
50 2100 NaN
51 2252 NaN
52 2143 NaN
53 2252 NaN
54 2222 NaN
55 2121 NaN
56 2409 NaN

A note on tidy data

Hadley Wickham wrote a great article in favor of "tidy data." Tidy DataFrames follow the rules:

  1. Each variable is a column.
  2. Each observation is a row.
  3. Each type of observation has its own separate DataFrame.

This is less pretty to visualize as a table, but we rarely look at data in tables. In my experience since taking Wickham's words to heart, I can say that keeping tidy DataFrames makes life much easier!

A tidy DataFrame of the cross-sectional area data would have two columns, "food density" and "cross-sectional area (sq micron)." Remember, each variable is a column. Each row, then corresponds to an individual measurement. This results in a lot of repeated entries (we'll have 44 highs), but it is very clean logically and easy to work with. Let's tidy our DataFrame. The pd.melt() function makes this easy.

In [25]:
df = pd.melt(df, var_name='food density', 
             value_name='cross-sectional area (sq micron)').dropna()

Here, we used the dropna() method to drop any row that contained a NaN. Saving this DataFrame takes a bit more storage, but it is much clearer.