Lesson 29: Introduction to Pandas¶

This lesson was generated from a Jupyter notebook. You can download the notebook here.

import numpy as np

# Pandas, conventionally imported as pd
import pandas as pd

# Plotting modules and settings.
import matplotlib.pyplot as plt
import seaborn as sns
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728',
          '#9467bd', '#8c564b', '#e377c2', '#7f7f7f',
          '#bcbd22', '#17becf']
sns.set(style='whitegrid', palette=colors, rc={'axes.labelsize': 16})

# The following is specific Jupyter notebooks
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}

Throughout your research career, you will undoubtedly need to handle data, possibly lots of data. The data comes in lots of formats, and you will spend much of your time wrangling the data to get it into a usable form. We already did a little bit of wrangling when we sorted the C. elegans egg cross sectional areas to be able to compute cumulative histograms.

Pandas is the primary tool in the SciPy stack for handling data. Its primary object, the DataFrame is extremely useful in wrangling data. We will explore some of that functionality here, and will put it to use in the next lesson.

Wrangling the egg cross-sectional area data¶

It is often useful to use the same data set to learn new techniques, since you are already familiar with the data. We'll keep using the egg cross-sectional area data set. In this case, we will use Pandas to import the data. We use the very handy (and faaaaar more powerful than np.loadtxt()) function pd.read_csv().

# Read in data files with pandas
df_high = pd.read_csv('data/xa_high_food.csv', comment='#')
df_low = pd.read_csv('data/xa_low_food.csv', comment='#')

Almost the same syntax as np.loadtxt(), but notice that the kwarg is comment and not comments. Let's now look at what we have.

df_high

Whoa! IPython is displaying this is a different way! What kind of data type is this?

type(df_low)

pandas.core.frame.DataFrame

Pandas has loaded the data in as a DataFrame. As I mentioned before, this is the central object for handling data.

We see above that we have a bold column heading (1683) and bold row indices. Pandas interpreted the first non-comment line as a label for a column. We need to tell it not to do that using the header kwarg. Let's re-load the data.

# Read in data files with pandas with no row headings.
df_high = pd.read_csv('data/xa_high_food.csv', comment='#', header=None)
df_low = pd.read_csv('data/xa_low_food.csv', comment='#', header=None)

# Now look at it
df_high

We see that Pandas has assigned a column heading of 0 to the column of data. What happens if we index it?

df_high[0]

0     1683
1     2061
2     1792
3     1852
4     2091
5     1781
6     1912
7     1802
8     1751
9     1731
10    1892
11    1951
12    1809
13    1683
14    1787
15    1840
16    1821
17    1910
18    1930
19    1800
20    1833
21    1683
22    1671
23    1680
24    1692
25    1800
26    1821
27    1882
28    1642
29    1749
30    1712
31    1661
32    1701
33    2141
34    1863
35    1752
36    1740
37    1721
38    1660
39    1930
40    2030
41    1851
42    2131
43    1828
Name: 0, dtype: int64

Whoa again! What is this?

type(df_high[0])

pandas.core.series.Series

A Pandas Series is basically a one-dimensional DataFrame. When we index it, we get what we might expect.

df_high[0][0]

1683

An aside: Football and learning what Pandas Series and DataFrames are¶

We can think of a Pandas Series as a generalized NumPy array. NumPy arrays are indexed with integers, but Pandas Series may be indexed with anything. As an example, we'll create a Series from a dictionary.

# Dictionary of top men's World Cup scorers and how many goals
wc_dict = {'Klose': 16,
           'Ronaldo': 15,
           'Müller': 14,
           'Fontaine': 13,
           'Pelé': 12,
           'Kocsis': 11,
           'Klinsmann': 11}

# Create a Series from the dictionary
s_goals = pd.Series(wc_dict)

# Take a look
s_goals

Fontaine     13
Klinsmann    11
Klose        16
Kocsis       11
Müller       14
Pelé         12
Ronaldo      15
dtype: int64

We'll notice that the indexing now works like the dictionary that created it.

s_goals['Klose']

16

Now, what if we wanted to add more data to this, such as the country that each player represented? We can make another Series.

# Dictionary of nations
nation_dict = {'Klose': 'Germany',
               'Ronaldo': 'Brazil',
               'Müller': 'Germany',
               'Fontaine': 'France',
               'Pelé': 'Brazil',
               'Kocsis': 'Hungary',
               'Klinsmann': 'Germany'}

# Series with nations
s_nation = pd.Series(nation_dict)

# Look at it
s_nation

Fontaine      France
Klinsmann    Germany
Klose        Germany
Kocsis       Hungary
Müller       Germany
Pelé          Brazil
Ronaldo       Brazil
dtype: object

Now, we can combine these into a DataFrame. We use pd.DataFrame() to instantiate a DataFrame, passing in a dictionary whose keys are the column headers and values are the series we're building into a DataFrame.

# Combine into a DataFrame
df_wc = pd.DataFrame({'nation': s_nation, 'goals': s_goals})

# Take a look
df_wc

Notice now that the DataFrame is indexed by player names. The column headings are goals and nation. When using the bracket notation, we cannot directly use the indices, only the column headings.

df_wc['Fontaine']

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Users/Justin/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2441             try:
-> 2442                 return self._engine.get_loc(key)
   2443             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 'Fontaine'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-60-ae1334ed016e> in <module>()
----> 1 df_wc['Fontaine']

/Users/Justin/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1962             return self._getitem_multilevel(key)
   1963         else:
-> 1964             return self._getitem_column(key)
   1965 
   1966     def _getitem_column(self, key):

/Users/Justin/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1969         # get column
   1970         if self.columns.is_unique:
-> 1971             return self._get_item_cache(key)
   1972 
   1973         # duplicate columns & possible reduce dimensionality

/Users/Justin/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1643         res = cache.get(item)
   1644         if res is None:
-> 1645             values = self._data.get(item)
   1646             res = self._box_item_values(item, values)
   1647             cache[item] = res

/Users/Justin/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3588 
   3589             if not isnull(item):
-> 3590                 loc = self.items.get_loc(item)
   3591             else:
   3592                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/Justin/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2442                 return self._engine.get_loc(key)
   2443             except KeyError:
-> 2444                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2445 
   2446         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 'Fontaine'

But we can index by columns.

df_wc['goals']

Fontaine     13
Klinsmann    11
Klose        16
Kocsis       11
Müller       14
Pelé         12
Ronaldo      15
Name: goals, dtype: int64

If we just just want the goals and nation of Fontaine, we would use the .loc attribute of a DataFrame, which allows slice-like indexing.

df_wc.loc['Fontaine', :]

goals         13
nation    France
Name: Fontaine, dtype: object

When using .loc, the first entry is the index and the second is the column. So, we asked Pandas to give us the row indexed with 'Fontaine' and all columns (using the colon, as in NumPy arrays).

We can only look at German players, for instance using similar Boolean indexing as with NumPy arrays.

df_wc.loc[df_wc['nation']=='Germany', :]

Combining the cross-sectional area data into a DataFrame¶

Now, back to our cross-sectional area data. We can combine these data into a DataFrame as well, even though they have differing numbers of data points. However, because that df_low and df_high are not Series but DataFrames, we cannot just use the method we used for the footballers. Instead, we will use the pd.concat() function to concatenate DataFrames.

An added complication is that, as they are now, the two DataFrames have the same column heading of 0. We should change that for each before concatenating.

# Change column headings
df_low.columns = ['low']
df_high.columns = ['high']

# Take a look
df_high

Now, we can concatenate the two DataFrames into one. We just pass a tuple with the DataFrames we want to concatenate as an argument. We specify the kwarg axis=1 to indicate that we want to have two columns, instead of just appending the second DataFrame at the bottom of the first (as we would get with axis=0).

# Concatenate DataFrames
df = pd.concat((df_low, df_high), axis=1)

# See the result
df

Note that the shorter of the two columns was filled with NaN, which means "not a number."

Tidy data¶

Hadley Wickham wrote a great article in favor of "tidy data." Tidy DataFrames follow the rules:

Each variable is a column.
Each observation is a row.
Each type of observation has its own separate DataFrame.

This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy DataFrame is almost always much easier to work with than non-tidy formats.

A tidy DataFrame of the cross-sectional area data has two columns, "food density" and "cross-sectional area (sq micron)." Remember, each variable is a column. Each row, then corresponds to an individual measurement. This results in a lot of repeated entries (we'll have 44 highs), but it is very clean logically and easy to work with. We will see later that we can use convenient and power plotting techniques if the data are tide. Let's tidy our DataFrame. The pd.melt() function makes this easy.

df = pd.melt(df, var_name='food density', 
             value_name='cross-sectional area (sq micron)').dropna()
df

Now, pulling out records we want is easy. Say we want all measurements from mothers with high food content where the eggs had a cross-sectional area of more than 2000 µm$^2$. We can use Boolean indexing with .loc.

# Specify indices we want (note parentheses holding each Boolean)
inds = (df['food density'] == 'high') & (df['cross-sectional area (sq micron)'] > 2000)

# Pull out areas
df.loc[inds, 'cross-sectional area (sq micron)']

58    2061.0
61    2091.0
90    2141.0
97    2030.0
99    2131.0
Name: cross-sectional area (sq micron), dtype: float64

We get a Series with the measurements we are after. Notice that the indices of the entries in the Series are preserved from the original DataFrame.

Outputting a new CSV file¶

We can now write out a single CSV file with the DataFrame. We use the index kwarg to ask Pandas not to explicitly write the indices to the file.

# Write out DataFrame
df.to_csv('xa_combined.csv', index=False)

Let's take a look at what this file looks like.

!head xa_combined.csv

food density,cross-sectional area (sq micron)
low,1840.0
low,2090.0
low,2169.0
low,1988.0
low,2212.0
low,2339.0
low,1989.0
low,2144.0
low,2290.0

The headers are now included. Now when we load the data, we get a convenient, tidy DataFrame.

# Load DataFrame
df_reloaded = pd.read_csv('xa_combined.csv')

# Take a look
df_reloaded

	low	high
0	1840	1683.0
1	2090	2061.0
2	2169	1792.0
3	1988	1852.0
4	2212	2091.0
5	2339	1781.0
6	1989	1912.0
7	2144	1802.0
8	2290	1751.0
9	1920	1731.0
10	2280	1892.0
11	1809	1951.0
12	2158	1809.0
13	1800	1683.0
14	2133	1787.0
15	2060	1840.0
16	2160	1821.0
17	2001	1910.0
18	2030	1930.0
19	2088	1800.0
20	1951	1833.0
21	2460	1683.0
22	2021	1671.0
23	2010	1680.0
24	2139	1692.0
25	2160	1800.0
26	2106	1821.0
27	2171	1882.0
28	2113	1642.0
29	2179	1749.0
30	1890	1712.0
31	2179	1661.0
32	2021	1701.0
33	1969	2141.0
34	2150	1863.0
35	1900	1752.0
36	2267	1740.0
37	1711	1721.0
38	1901	1660.0
39	2114	1930.0
40	2112	2030.0
41	2361	1851.0
42	2130	2131.0
43	2061	1828.0
44	2121	NaN
45	1832	NaN
46	2210	NaN
47	2130	NaN
48	2153	NaN
49	2009	NaN
50	2100	NaN
51	2252	NaN
52	2143	NaN
53	2252	NaN
54	2222	NaN
55	2121	NaN
56	2409	NaN

	1683
0	2061
1	1792
2	1852
3	2091
4	1781
5	1912
6	1802
7	1751
8	1731
9	1892
10	1951
11	1809
12	1683
13	1787
14	1840
15	1821
16	1910
17	1930
18	1800
19	1833
20	1683
21	1671
22	1680
23	1692
24	1800
25	1821
26	1882
27	1642
28	1749
29	1712
30	1661
31	1701
32	2141
33	1863
34	1752
35	1740
36	1721
37	1660
38	1930
39	2030
40	1851
41	2131
42	1828

	goals	nation
Fontaine	13	France
Klinsmann	11	Germany
Klose	16	Germany
Kocsis	11	Hungary
Müller	14	Germany
Pelé	12	Brazil
Ronaldo	15	Brazil

	food density	cross-sectional area (sq micron)
0	low	1840.0
1	low	2090.0
2	low	2169.0
3	low	1988.0
4	low	2212.0
5	low	2339.0
6	low	1989.0
7	low	2144.0
8	low	2290.0
9	low	1920.0
10	low	2280.0
11	low	1809.0
12	low	2158.0
13	low	1800.0
14	low	2133.0
15	low	2060.0
16	low	2160.0
17	low	2001.0
18	low	2030.0
19	low	2088.0
20	low	1951.0
21	low	2460.0
22	low	2021.0
23	low	2010.0
24	low	2139.0
25	low	2160.0
26	low	2106.0
27	low	2171.0
28	low	2113.0
29	low	2179.0
...	...	...
71	high	1787.0
72	high	1840.0
73	high	1821.0
74	high	1910.0
75	high	1930.0
76	high	1800.0
77	high	1833.0
78	high	1683.0
79	high	1671.0
80	high	1680.0
81	high	1692.0
82	high	1800.0
83	high	1821.0
84	high	1882.0
85	high	1642.0
86	high	1749.0
87	high	1712.0
88	high	1661.0
89	high	1701.0
90	high	2141.0
91	high	1863.0
92	high	1752.0
93	high	1740.0
94	high	1721.0
95	high	1660.0
96	high	1930.0
97	high	2030.0
98	high	1851.0
99	high	2131.0
100	high	1828.0