Lesson 29: Introduction to Pandas

(c) 2017 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This lesson was generated from a Jupyter notebook. You can download the notebook here.

In [48]:
import numpy as np

# Pandas, conventionally imported as pd
import pandas as pd

# Plotting modules and settings.
import matplotlib.pyplot as plt
import seaborn as sns
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728',
          '#9467bd', '#8c564b', '#e377c2', '#7f7f7f',
          '#bcbd22', '#17becf']
sns.set(style='whitegrid', palette=colors, rc={'axes.labelsize': 16})

# The following is specific Jupyter notebooks
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}

Throughout your research career, you will undoubtedly need to handle data, possibly lots of data. The data comes in lots of formats, and you will spend much of your time wrangling the data to get it into a usable form. We already did a little bit of wrangling when we sorted the C. elegans egg cross sectional areas to be able to compute cumulative histograms.

Pandas is the primary tool in the SciPy stack for handling data. Its primary object, the DataFrame is extremely useful in wrangling data. We will explore some of that functionality here, and will put it to use in the next lesson.

Wrangling the egg cross-sectional area data

It is often useful to use the same data set to learn new techniques, since you are already familiar with the data. We'll keep using the egg cross-sectional area data set. In this case, we will use Pandas to import the data. We use the very handy (and faaaaar more powerful than np.loadtxt()) function pd.read_csv().

In [49]:
# Read in data files with pandas
df_high = pd.read_csv('data/xa_high_food.csv', comment='#')
df_low = pd.read_csv('data/xa_low_food.csv', comment='#')

Almost the same syntax as np.loadtxt(), but notice that the kwarg is comment and not comments. Let's now look at what we have.

In [50]:
df_high
Out[50]:
1683
0 2061
1 1792
2 1852
3 2091
4 1781
5 1912
6 1802
7 1751
8 1731
9 1892
10 1951
11 1809
12 1683
13 1787
14 1840
15 1821
16 1910
17 1930
18 1800
19 1833
20 1683
21 1671
22 1680
23 1692
24 1800
25 1821
26 1882
27 1642
28 1749
29 1712
30 1661
31 1701
32 2141
33 1863
34 1752
35 1740
36 1721
37 1660
38 1930
39 2030
40 1851
41 2131
42 1828

Whoa! IPython is displaying this is a different way! What kind of data type is this?

In [51]:
type(df_low)
Out[51]:
pandas.core.frame.DataFrame

Pandas has loaded the data in as a DataFrame. As I mentioned before, this is the central object for handling data.

We see above that we have a bold column heading (1683) and bold row indices. Pandas interpreted the first non-comment line as a label for a column. We need to tell it not to do that using the header kwarg. Let's re-load the data.

In [52]:
# Read in data files with pandas with no row headings.
df_high = pd.read_csv('data/xa_high_food.csv', comment='#', header=None)
df_low = pd.read_csv('data/xa_low_food.csv', comment='#', header=None)

# Now look at it
df_high
Out[52]:
0
0 1683
1 2061
2 1792
3 1852
4 2091
5 1781
6 1912
7 1802
8 1751
9 1731
10 1892
11 1951
12 1809
13 1683
14 1787
15 1840
16 1821
17 1910
18 1930
19 1800
20 1833
21 1683
22 1671
23 1680
24 1692
25 1800
26 1821
27 1882
28 1642
29 1749
30 1712
31 1661
32 1701
33 2141
34 1863
35 1752
36 1740
37 1721
38 1660
39 1930
40 2030
41 1851
42 2131
43 1828

We see that Pandas has assigned a column heading of 0 to the column of data. What happens if we index it?

In [53]:
df_high[0]
Out[53]:
0     1683
1     2061
2     1792
3     1852
4     2091
5     1781
6     1912
7     1802
8     1751
9     1731
10    1892
11    1951
12    1809
13    1683
14    1787
15    1840
16    1821
17    1910
18    1930
19    1800
20    1833
21    1683
22    1671
23    1680
24    1692
25    1800
26    1821
27    1882
28    1642
29    1749
30    1712
31    1661
32    1701
33    2141
34    1863
35    1752
36    1740
37    1721
38    1660
39    1930
40    2030
41    1851
42    2131
43    1828
Name: 0, dtype: int64

Whoa again! What is this?

In [54]:
type(df_high[0])
Out[54]:
pandas.core.series.Series

A Pandas Series is basically a one-dimensional DataFrame. When we index it, we get what we might expect.

In [55]:
df_high[0][0]
Out[55]:
1683

An aside: Football and learning what Pandas Series and DataFrames are

We can think of a Pandas Series as a generalized NumPy array. NumPy arrays are indexed with integers, but Pandas Series may be indexed with anything. As an example, we'll create a Series from a dictionary.

In [56]:
# Dictionary of top men's World Cup scorers and how many goals
wc_dict = {'Klose': 16,
           'Ronaldo': 15,
           'Müller': 14,
           'Fontaine': 13,
           'Pelé': 12,
           'Kocsis': 11,
           'Klinsmann': 11}

# Create a Series from the dictionary
s_goals = pd.Series(wc_dict)

# Take a look
s_goals
Out[56]:
Fontaine     13
Klinsmann    11
Klose        16
Kocsis       11
Müller       14
Pelé         12
Ronaldo      15
dtype: int64

We'll notice that the indexing now works like the dictionary that created it.

In [57]:
s_goals['Klose']
Out[57]:
16

Now, what if we wanted to add more data to this, such as the country that each player represented? We can make another Series.

In [58]:
# Dictionary of nations
nation_dict = {'Klose': 'Germany',
               'Ronaldo': 'Brazil',
               'Müller': 'Germany',
               'Fontaine': 'France',
               'Pelé': 'Brazil',
               'Kocsis': 'Hungary',
               'Klinsmann': 'Germany'}

# Series with nations
s_nation = pd.Series(nation_dict)

# Look at it
s_nation
Out[58]:
Fontaine      France
Klinsmann    Germany
Klose        Germany
Kocsis       Hungary
Müller       Germany
Pelé          Brazil
Ronaldo       Brazil
dtype: object

Now, we can combine these into a DataFrame. We use pd.DataFrame() to instantiate a DataFrame, passing in a dictionary whose keys are the column headers and values are the series we're building into a DataFrame.

In [59]:
# Combine into a DataFrame
df_wc = pd.DataFrame({'nation': s_nation, 'goals': s_goals})

# Take a look
df_wc
Out[59]:
goals nation
Fontaine 13 France
Klinsmann 11 Germany
Klose 16 Germany
Kocsis 11 Hungary
Müller 14 Germany
Pelé 12 Brazil
Ronaldo 15 Brazil

Notice now that the DataFrame is indexed by player names. The column headings are goals and nation. When using the bracket notation, we cannot directly use the indices, only the column headings.

In [60]:
df_wc['Fontaine']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Users/Justin/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2441             try:
-> 2442                 return self._engine.get_loc(key)
   2443             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 'Fontaine'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-60-ae1334ed016e> in <module>()
----> 1 df_wc['Fontaine']

/Users/Justin/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1962             return self._getitem_multilevel(key)
   1963         else:
-> 1964             return self._getitem_column(key)
   1965 
   1966     def _getitem_column(self, key):

/Users/Justin/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1969         # get column
   1970         if self.columns.is_unique:
-> 1971             return self._get_item_cache(key)
   1972 
   1973         # duplicate columns & possible reduce dimensionality

/Users/Justin/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1643         res = cache.get(item)
   1644         if res is None:
-> 1645             values = self._data.get(item)
   1646             res = self._box_item_values(item, values)
   1647             cache[item] = res

/Users/Justin/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3588 
   3589             if not isnull(item):
-> 3590                 loc = self.items.get_loc(item)
   3591             else:
   3592                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/Justin/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2442                 return self._engine.get_loc(key)
   2443             except KeyError:
-> 2444                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2445 
   2446         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 'Fontaine'

But we can index by columns.

In [61]:
df_wc['goals']
Out[61]:
Fontaine     13
Klinsmann    11
Klose        16
Kocsis       11
Müller       14
Pelé         12
Ronaldo      15
Name: goals, dtype: int64

If we just just want the goals and nation of Fontaine, we would use the .loc attribute of a DataFrame, which allows slice-like indexing.

In [62]:
df_wc.loc['Fontaine', :]
Out[62]:
goals         13
nation    France
Name: Fontaine, dtype: object

When using .loc, the first entry is the index and the second is the column. So, we asked Pandas to give us the row indexed with 'Fontaine' and all columns (using the colon, as in NumPy arrays).

We can only look at German players, for instance using similar Boolean indexing as with NumPy arrays.

In [63]:
df_wc.loc[df_wc['nation']=='Germany', :]
Out[63]:
goals nation
Klinsmann 11 Germany
Klose 16 Germany
Müller 14 Germany

Combining the cross-sectional area data into a DataFrame

Now, back to our cross-sectional area data. We can combine these data into a DataFrame as well, even though they have differing numbers of data points. However, because that df_low and df_high are not Series but DataFrames, we cannot just use the method we used for the footballers. Instead, we will use the pd.concat() function to concatenate DataFrames.

An added complication is that, as they are now, the two DataFrames have the same column heading of 0. We should change that for each before concatenating.

In [64]:
# Change column headings
df_low.columns = ['low']
df_high.columns = ['high']

# Take a look
df_high
Out[64]:
high
0 1683
1 2061
2 1792
3 1852
4 2091
5 1781
6 1912
7 1802
8 1751
9 1731
10 1892
11 1951
12 1809
13 1683
14 1787
15 1840
16 1821
17 1910
18 1930
19 1800
20 1833
21 1683
22 1671
23 1680
24 1692
25 1800
26 1821
27 1882
28 1642
29 1749
30 1712
31 1661
32 1701
33 2141
34 1863
35 1752
36 1740
37 1721
38 1660
39 1930
40 2030
41 1851
42 2131
43 1828

Now, we can concatenate the two DataFrames into one. We just pass a tuple with the DataFrames we want to concatenate as an argument. We specify the kwarg axis=1 to indicate that we want to have two columns, instead of just appending the second DataFrame at the bottom of the first (as we would get with axis=0).

In [65]:
# Concatenate DataFrames
df = pd.concat((df_low, df_high), axis=1)

# See the result
df
Out[65]:
low high
0 1840 1683.0
1 2090 2061.0
2 2169 1792.0
3 1988 1852.0
4 2212 2091.0
5 2339 1781.0
6 1989 1912.0
7 2144 1802.0
8 2290 1751.0
9 1920 1731.0
10 2280 1892.0
11 1809 1951.0
12 2158 1809.0
13 1800 1683.0
14 2133 1787.0
15 2060 1840.0
16 2160 1821.0
17 2001 1910.0
18 2030 1930.0
19 2088 1800.0
20 1951 1833.0
21 2460 1683.0
22 2021 1671.0
23 2010 1680.0
24 2139 1692.0
25 2160 1800.0
26 2106 1821.0
27 2171 1882.0
28 2113 1642.0
29 2179 1749.0
30 1890 1712.0
31 2179 1661.0
32 2021 1701.0
33 1969 2141.0
34 2150 1863.0
35 1900 1752.0
36 2267 1740.0
37 1711 1721.0
38 1901 1660.0
39 2114 1930.0
40 2112 2030.0
41 2361 1851.0
42 2130 2131.0
43 2061 1828.0
44 2121 NaN
45 1832 NaN
46 2210 NaN
47 2130 NaN
48 2153 NaN
49 2009 NaN
50 2100 NaN
51 2252 NaN
52 2143 NaN
53 2252 NaN
54 2222 NaN
55 2121 NaN
56 2409 NaN

Note that the shorter of the two columns was filled with NaN, which means "not a number."

Tidy data

Hadley Wickham wrote a great article in favor of "tidy data." Tidy DataFrames follow the rules:

  1. Each variable is a column.
  2. Each observation is a row.
  3. Each type of observation has its own separate DataFrame.

This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy DataFrame is almost always much easier to work with than non-tidy formats.

A tidy DataFrame of the cross-sectional area data has two columns, "food density" and "cross-sectional area (sq micron)." Remember, each variable is a column. Each row, then corresponds to an individual measurement. This results in a lot of repeated entries (we'll have 44 highs), but it is very clean logically and easy to work with. We will see later that we can use convenient and power plotting techniques if the data are tide. Let's tidy our DataFrame. The pd.melt() function makes this easy.

In [66]:
df = pd.melt(df, var_name='food density', 
             value_name='cross-sectional area (sq micron)').dropna()
df
Out[66]:
food density cross-sectional area (sq micron)
0 low 1840.0
1 low 2090.0
2 low 2169.0
3 low 1988.0
4 low 2212.0
5 low 2339.0
6 low 1989.0
7 low 2144.0
8 low 2290.0
9 low 1920.0
10 low 2280.0
11 low 1809.0
12 low 2158.0
13 low 1800.0
14 low 2133.0
15 low 2060.0
16 low 2160.0
17 low 2001.0
18 low 2030.0
19 low 2088.0
20 low 1951.0
21 low 2460.0
22 low 2021.0
23 low 2010.0
24 low 2139.0
25 low 2160.0
26 low 2106.0
27 low 2171.0
28 low 2113.0
29 low 2179.0
... ... ...
71 high 1787.0
72 high 1840.0
73 high 1821.0
74 high 1910.0
75 high 1930.0
76 high 1800.0
77 high 1833.0
78 high 1683.0
79 high 1671.0
80 high 1680.0
81 high 1692.0
82 high 1800.0
83 high 1821.0
84 high 1882.0
85 high 1642.0
86 high 1749.0
87 high 1712.0
88 high 1661.0
89 high 1701.0
90 high 2141.0
91 high 1863.0
92 high 1752.0
93 high 1740.0
94 high 1721.0
95 high 1660.0
96 high 1930.0
97 high 2030.0
98 high 1851.0
99 high 2131.0
100 high 1828.0

101 rows × 2 columns

Now, pulling out records we want is easy. Say we want all measurements from mothers with high food content where the eggs had a cross-sectional area of more than 2000 µm$^2$. We can use Boolean indexing with .loc.

In [70]:
# Specify indices we want (note parentheses holding each Boolean)
inds = (df['food density'] == 'high') & (df['cross-sectional area (sq micron)'] > 2000)

# Pull out areas
df.loc[inds, 'cross-sectional area (sq micron)']
Out[70]:
58    2061.0
61    2091.0
90    2141.0
97    2030.0
99    2131.0
Name: cross-sectional area (sq micron), dtype: float64

We get a Series with the measurements we are after. Notice that the indices of the entries in the Series are preserved from the original DataFrame.

Outputting a new CSV file

We can now write out a single CSV file with the DataFrame. We use the index kwarg to ask Pandas not to explicitly write the indices to the file.

In [71]:
# Write out DataFrame
df.to_csv('xa_combined.csv', index=False)

Let's take a look at what this file looks like.

In [72]:
!head xa_combined.csv
food density,cross-sectional area (sq micron)
low,1840.0
low,2090.0
low,2169.0
low,1988.0
low,2212.0
low,2339.0
low,1989.0
low,2144.0
low,2290.0

The headers are now included. Now when we load the data, we get a convenient, tidy DataFrame.

In [73]:
# Load DataFrame
df_reloaded = pd.read_csv('xa_combined.csv')

# Take a look
df_reloaded
Out[73]:
food density cross-sectional area (sq micron)
0 low 1840.0
1 low 2090.0
2 low 2169.0
3 low 1988.0
4 low 2212.0
5 low 2339.0
6 low 1989.0
7 low 2144.0
8 low 2290.0
9 low 1920.0
10 low 2280.0
11 low 1809.0
12 low 2158.0
13 low 1800.0
14 low 2133.0
15 low 2060.0
16 low 2160.0
17 low 2001.0
18 low 2030.0
19 low 2088.0
20 low 1951.0
21 low 2460.0
22 low 2021.0
23 low 2010.0
24 low 2139.0
25 low 2160.0
26 low 2106.0
27 low 2171.0
28 low 2113.0
29 low 2179.0
... ... ...
71 high 1787.0
72 high 1840.0
73 high 1821.0
74 high 1910.0
75 high 1930.0
76 high 1800.0
77 high 1833.0
78 high 1683.0
79 high 1671.0
80 high 1680.0
81 high 1692.0
82 high 1800.0
83 high 1821.0
84 high 1882.0
85 high 1642.0
86 high 1749.0
87 high 1712.0
88 high 1661.0
89 high 1701.0
90 high 2141.0
91 high 1863.0
92 high 1752.0
93 high 1740.0
94 high 1721.0
95 high 1660.0
96 high 1930.0
97 high 2030.0
98 high 1851.0
99 high 2131.0
100 high 1828.0

101 rows × 2 columns