Lesson 29: Introduction to Pandas

This lesson was generated from a Jupyter notebook. You can download the notebook here.

In [1]:
# NumPy of course
import numpy as np

# Pandas, conventionally imported as pd
import pandas as pd

# This is how we import the module of Matplotlib we'll be using
import matplotlib.pyplot as plt

# Seaborn makes plots pretty!
import seaborn as sns

# PyBeeswarm for some better data display
import beeswarm

# Magic function to make matplotlib inline; other style specs must come AFTER
%matplotlib inline

# This enables SVG graphics inline (only use with static plots (non-Bokeh))
%config InlineBackend.figure_format = 'svg'

# Set JB's favorite Seaborn settings
rc={'lines.linewidth': 2, 'axes.labelsize': 18, 'axes.titlesize': 18, 
    'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', rc=rc)

Throughout your research career, you will undoubtedly need to handle data, possibly lots of data. The data comes in lots of formats, and you will spend much of your time munging (or wrangling) the data to get it into a usable form. We already did a little bit of munging when we sorted the C. elegans egg cross sectional areas to be able to compute cumulative histograms.

Pandas is the primary tool in the SciPy stack for handling data. Its primary object, the DataFrame is extremely useful in munging data. We will explore some of that functionality here, and will put it to use in the next lesson.

Munging the egg cross-sectional area data

It os often useful to use the same data set to learn new things, since you are already familiar with the data. We'll keep using the egg cross-sectional area data set. In this case, we will sue Pandas to import the data. We use the very handy (and faaaaar more powerful than np.loadtxt()) function pd.read_csv().

In [2]:
# Read in data files with pandas
df_high = pd.read_csv('../data/xa_high_food.csv', comment='#')
df_low = pd.read_csv('../data/xa_low_food.csv', comment='#')

Almost the same syntax as np.loadtxt(), but notice that the kwarg is comment and not comments. Let's now look at what we have.

In [3]:
df_high
Out[3]:
1683
0 2061
1 1792
2 1852
3 2091
4 1781
5 1912
6 1802
7 1751
8 1731
9 1892
10 1951
11 1809
12 1683
13 1787
14 1840
15 1821
16 1910
17 1930
18 1800
19 1833
20 1683
21 1671
22 1680
23 1692
24 1800
25 1821
26 1882
27 1642
28 1749
29 1712
30 1661
31 1701
32 2141
33 1863
34 1752
35 1740
36 1721
37 1660
38 1930
39 2030
40 1851
41 2131
42 1828

Whoa! IPython is displaying this is a different way! What kind of data type is this?

In [4]:
type(df_low)
Out[4]:
pandas.core.frame.DataFrame

Pandas has loaded the data in as a DataFrame. As I mentioned before, this is the central object for handling data.

We see that we have a bold column heading (1683) and bold row indices. Pandas interpreted the first non-comment line as a label for a column. We need to tell it not to do that using the header kwarg. Let's re-load the data.

In [5]:
# Read in data files with pandas with no row headings.
df_high = pd.read_csv('../data/xa_high_food.csv', comment='#', header=None)
df_low = pd.read_csv('../data/xa_low_food.csv', comment='#', header=None)

# Now look at it
df_high
Out[5]:
0
0 1683
1 2061
2 1792
3 1852
4 2091
5 1781
6 1912
7 1802
8 1751
9 1731
10 1892
11 1951
12 1809
13 1683
14 1787
15 1840
16 1821
17 1910
18 1930
19 1800
20 1833
21 1683
22 1671
23 1680
24 1692
25 1800
26 1821
27 1882
28 1642
29 1749
30 1712
31 1661
32 1701
33 2141
34 1863
35 1752
36 1740
37 1721
38 1660
39 1930
40 2030
41 1851
42 2131
43 1828

We see that Pandas has assigned a column heading of 0 to the column of data. What happens if we index it?

In [6]:
df_high[0]
Out[6]:
0     1683
1     2061
2     1792
3     1852
4     2091
5     1781
6     1912
7     1802
8     1751
9     1731
10    1892
11    1951
12    1809
13    1683
14    1787
15    1840
16    1821
17    1910
18    1930
19    1800
20    1833
21    1683
22    1671
23    1680
24    1692
25    1800
26    1821
27    1882
28    1642
29    1749
30    1712
31    1661
32    1701
33    2141
34    1863
35    1752
36    1740
37    1721
38    1660
39    1930
40    2030
41    1851
42    2131
43    1828
Name: 0, dtype: int64

Whoa again! What is this?

In [7]:
type(df_high[0])
Out[7]:
pandas.core.series.Series

A Pandas Series is basically a one-dimensional DataFrame. When we index it, we get what we might expect.

In [8]:
df_high[0][0]
Out[8]:
1683

An aside: Football and learning what Pandas Series and DataFrames are

We can think of a Pandas Series as a generalized NumPy array. NumPy arrays are indexed with integers, but Pandas Series may be indexed with anything. As an example, we'll create a Series from a dictionary.

In [9]:
# Dictionary of top men's World Cup scorers and how many goals
wc_dict = {'Klose': 16,
           'Ronaldo': 15,
           'Müller': 14,
           'Fontaine': 13,
           'Pelé': 12,
           'Koscis': 11,
           'Klinsmann': 11}

# Create a Series from the dictionary
s_goals = pd.Series(wc_dict)

# Take a look
s_goals
Out[9]:
Fontaine     13
Klinsmann    11
Klose        16
Koscis       11
Müller       14
Pelé         12
Ronaldo      15
dtype: int64

We'll notice that the indexing now works like the dictionary that created it.

In [10]:
s_goals['Klose']
Out[10]:
16

Now, what if we wanted to add more data to this, such as the country that each player represented? We can make another Series.

In [11]:
# Dictionary of nations
nation_dict = {'Klose': 'Germany',
               'Ronaldo': 'Brazil',
               'Müller': 'Germany',
               'Fontaine': 'France',
               'Pelé': 'Brazil',
               'Koscis': 'Hungary',
               'Klinsmann': 'Germany'}

# Series with nations
s_nation = pd.Series(nation_dict)

# Look at it
s_nation
Out[11]:
Fontaine      France
Klinsmann    Germany
Klose        Germany
Koscis       Hungary
Müller       Germany
Pelé          Brazil
Ronaldo       Brazil
dtype: object

Now, we can combine these into a DataFrame. We use pd.DataFrame() to instantiate a DataFrame, passing in a dictionary whose keys are the column headers and values are the series we're building into a DataFrame.

In [12]:
# Combine into a DataFrame
df_wc = pd.DataFrame({'nation': s_nation, 'goals': s_goals})

# Take a look
df_wc
Out[12]:
goals nation
Fontaine 13 France
Klinsmann 11 Germany
Klose 16 Germany
Koscis 11 Hungary
Müller 14 Germany
Pelé 12 Brazil
Ronaldo 15 Brazil

Notice now that the DataFrame is indexed by player names. The column headings are goals and nation. When using the bracket notation, we cannot directly use the indices, only the column headings.

In [13]:
df_wc['Fontaine']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-13-ae1334ed016e> in <module>()
----> 1 df_wc['Fontaine']

/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1795             return self._getitem_multilevel(key)
   1796         else:
-> 1797             return self._getitem_column(key)
   1798 
   1799     def _getitem_column(self, key):

/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1802         # get column
   1803         if self.columns.is_unique:
-> 1804             return self._get_item_cache(key)
   1805 
   1806         # duplicate columns & possible reduce dimensionaility

/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1082         res = cache.get(item)
   1083         if res is None:
-> 1084             values = self._data.get(item)
   1085             res = self._box_item_values(item, values)
   1086             cache[item] = res

/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   2849 
   2850             if not isnull(item):
-> 2851                 loc = self.items.get_loc(item)
   2852             else:
   2853                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/index.py in get_loc(self, key, method)
   1570         """
   1571         if method is None:
-> 1572             return self._engine.get_loc(_values_from_object(key))
   1573 
   1574         indexer = self.get_indexer([key], method=method)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12280)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12231)()

KeyError: 'Fontaine'

But we can index by columns.

In [14]:
df_wc['goals']
Out[14]:
Fontaine     13
Klinsmann    11
Klose        16
Koscis       11
Müller       14
Pelé         12
Ronaldo      15
Name: goals, dtype: int64

If we just just want the goals and nation of Fontaine, we would use the .ix method.

In [15]:
df_wc.ix['Fontaine']
Out[15]:
goals         13
nation    France
Name: Fontaine, dtype: object

We can only look at German players, for instance using similar Boolean indexing as with NumPy arrays.

In [16]:
df_wc[df_wc['nation'] == 'Germany']
Out[16]:
goals nation
Klinsmann 11 Germany
Klose 16 Germany
Müller 14 Germany

As a final demonstration of the supreme power of DataFrames, they come with plotting methods! We can make a bar chart of the goal scorers, first sorting them in order of number of goals scored.

In [19]:
# Make a bar chart of goals
ax = df_wc.sort('goals').plot(y='goals', kind='bar', legend=False)
plt.ylabel('goals')
ax.xaxis.grid(False)

Combining the cross-sectional area data into a DataFrame

Now, back to our cross-sectional area data. We can combine these data into a DataFrame as well, even though they have differing numbers of data points. However, because that df_low and df_high are not Series, we cannot just use the method we used for the footballers. Instead, we will use the pd.concat() function to concatenate DataFrames.

An added complication is that, as they are now, the two DataFrames have the same column heading of 0. We should change that for each before concatenating.

In [20]:
# Change column headings
df_low.columns = ['low']
df_high.columns = ['high']

# Take a look
df_high
Out[20]:
high
0 1683
1 2061
2 1792
3 1852
4 2091
5 1781
6 1912
7 1802
8 1751
9 1731
10 1892
11 1951
12 1809
13 1683
14 1787
15 1840
16 1821
17 1910
18 1930
19 1800
20 1833
21 1683
22 1671
23 1680
24 1692
25 1800
26 1821
27 1882
28 1642
29 1749
30 1712
31 1661
32 1701
33 2141
34 1863
35 1752
36 1740
37 1721
38 1660
39 1930
40 2030
41 1851
42 2131
43 1828

Now, we can concatenate the two DataFrames into one. We just pass a tuple with the DataFrames we want to concatenate as an argument. We specify the kwarg axis = 1 to indincate that we want to have two columns, instead of just appending the second DataFrame at the bottom of the first (as we would get with axis = 0).

In [21]:
# Concatenate DataFrames
df = pd.concat((df_low, df_high), axis=1)

# See the result
df
Out[21]:
low high
0 1840 1683
1 2090 2061
2 2169 1792
3 1988 1852
4 2212 2091
5 2339 1781
6 1989 1912
7 2144 1802
8 2290 1751
9 1920 1731
10 2280 1892
11 1809 1951
12 2158 1809
13 1800 1683
14 2133 1787
15 2060 1840
16 2160 1821
17 2001 1910
18 2030 1930
19 2088 1800
20 1951 1833
21 2460 1683
22 2021 1671
23 2010 1680
24 2139 1692
25 2160 1800
26 2106 1821
27 2171 1882
28 2113 1642
29 2179 1749
30 1890 1712
31 2179 1661
32 2021 1701
33 1969 2141
34 2150 1863
35 1900 1752
36 2267 1740
37 1711 1721
38 1901 1660
39 2114 1930
40 2112 2030
41 2361 1851
42 2130 2131
43 2061 1828
44 2121 NaN
45 1832 NaN
46 2210 NaN
47 2130 NaN
48 2153 NaN
49 2009 NaN
50 2100 NaN
51 2252 NaN
52 2143 NaN
53 2252 NaN
54 2222 NaN
55 2121 NaN
56 2409 NaN

Note that the shorter of the two columns was filled with NaN, which means "not a number." This DataFrame structure makes it easier to make plots. For example, you may have wondered why we did not make boxplots from the data before. The reason was because it was too syntactically painful without the elegance of DataFrames. We can just directly use the built-in plotting capabilities.

In [22]:
ax = df.plot(kind='box')
plt.ylabel('X-sect. area (µm$^2$)')
ax.xaxis.grid(False)

Here, the red lines are the median, the top and bottom of the box are the 75th and 25th percentile, respectively, and the whiskers cover the total extent of the measurements.

Outputting a new CSV file

We can now write out a single CSV file with the DataFrame. We use the index kwarg to ask Pandas now to explicity write the indices to the file.

In [23]:
# Write out DataFrame
df.to_csv('xa_combined.csv', index=False)

Now when we load the data, we get a convenient DataFrame.

In [24]:
# Load DataFrame
df_reloaded = pd.read_csv('xa_combined.csv')

# Take a look
df_reloaded
Out[24]:
low high
0 1840 1683
1 2090 2061
2 2169 1792
3 1988 1852
4 2212 2091
5 2339 1781
6 1989 1912
7 2144 1802
8 2290 1751
9 1920 1731
10 2280 1892
11 1809 1951
12 2158 1809
13 1800 1683
14 2133 1787
15 2060 1840
16 2160 1821
17 2001 1910
18 2030 1930
19 2088 1800
20 1951 1833
21 2460 1683
22 2021 1671
23 2010 1680
24 2139 1692
25 2160 1800
26 2106 1821
27 2171 1882
28 2113 1642
29 2179 1749
30 1890 1712
31 2179 1661
32 2021 1701
33 1969 2141
34 2150 1863
35 1900 1752
36 2267 1740
37 1711 1721
38 1901 1660
39 2114 1930
40 2112 2030
41 2361 1851
42 2130 2131
43 2061 1828
44 2121 NaN
45 1832 NaN
46 2210 NaN
47 2130 NaN
48 2153 NaN
49 2009 NaN
50 2100 NaN
51 2252 NaN
52 2143 NaN
53 2252 NaN
54 2222 NaN
55 2121 NaN
56 2409 NaN

Beeswarm plots

Besides the fact that I hate dashed lines when they are unnecessary, I also believe that you should plot all points when the number of points is below about 100. A good way to do this is a beeswarm plot. We'll use pybeeswarm to do this.

In [35]:
# pybeeswarm needs list of NumPy arrays for plotting
list_of_data = [df['low'].dropna(), df['high'].dropna()]
labels = ['low', 'high']

# Generate a beeswarm plot
_, ax = beeswarm.beeswarm(list_of_data, labels=labels)
plt.ylabel('X-sect. area (µm$^2$)')
ax.xaxis.grid(False)

A note on tidy data

Hadley Wickham wrote a great article in favor of "tidy data." Tidy DataFrames follow the rules:

  1. Each variable is a column.
  2. Each observation is a row.
  3. Each type of observation has its own separate DataFrame.

This is less pretty to visualize as a table, but we rarely look at data in tables. In my brief experience since taking Wickham's words to heart, I can say that keeping tidy DataFrames makes life much easier!

A tidy DataFrame of the cross-sectional area data would have two columns, "food density" and "cross-sectional area (sq micron)." Remember, each variable is a column. Each row, then corresponds to an individual measurement. This results in a lot of repeated entries (we'll have 44 highs), but it is very clean logically and easy to work with. Let's tidy our DataFrame. The pd.melt() function makes this easy.

In [36]:
pd.melt(df, var_name='food density', 
        value_name='cross-sectional area (sq micron)').dropna()
Out[36]:
food density cross-sectional area (sq micron)
0 low 1840
1 low 2090
2 low 2169
3 low 1988
4 low 2212
5 low 2339
6 low 1989
7 low 2144
8 low 2290
9 low 1920
10 low 2280
11 low 1809
12 low 2158
13 low 1800
14 low 2133
15 low 2060
16 low 2160
17 low 2001
18 low 2030
19 low 2088
20 low 1951
21 low 2460
22 low 2021
23 low 2010
24 low 2139
25 low 2160
26 low 2106
27 low 2171
28 low 2113
29 low 2179
... ... ...
71 high 1787
72 high 1840
73 high 1821
74 high 1910
75 high 1930
76 high 1800
77 high 1833
78 high 1683
79 high 1671
80 high 1680
81 high 1692
82 high 1800
83 high 1821
84 high 1882
85 high 1642
86 high 1749
87 high 1712
88 high 1661
89 high 1701
90 high 2141
91 high 1863
92 high 1752
93 high 1740
94 high 1721
95 high 1660
96 high 1930
97 high 2030
98 high 1851
99 high 2131
100 high 1828

101 rows × 2 columns

Saving this DataFrame takes a bit more storage, but it is much clearer. For example, it does not rely on comments to give the units of the measurements.

Not to worry, the default plotting still works with this format.

In [38]:
ax = df.plot(kind='box')
plt.ylabel('X-sect. area (µm$^2$)')
ax.xaxis.grid(False)