{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lesson 29: Introduction to Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*This lesson was generated from a Jupyter notebook. You can download the notebook [here](l29_intro_to_pandas.ipynb).*"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# NumPy of course\n",
"import numpy as np\n",
"\n",
"# Pandas, conventionally imported as pd\n",
"import pandas as pd\n",
"\n",
"# This is how we import the module of Matplotlib we'll be using\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Seaborn makes plots pretty!\n",
"import seaborn as sns\n",
"\n",
"# PyBeeswarm for some better data display\n",
"import beeswarm\n",
"\n",
"# Magic function to make matplotlib inline; other style specs must come AFTER\n",
"%matplotlib inline\n",
"\n",
"# This enables SVG graphics inline (only use with static plots (non-Bokeh))\n",
"%config InlineBackend.figure_format = 'svg'\n",
"\n",
"# Set JB's favorite Seaborn settings\n",
"rc={'lines.linewidth': 2, 'axes.labelsize': 18, 'axes.titlesize': 18, \n",
" 'axes.facecolor': 'DFDFE5'}\n",
"sns.set_context('notebook', rc=rc)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Throughout your research career, you will undoubtedly need to handle data, possibly lots of data. The data comes in lots of formats, and you will spend much of your time **munging** (or **wrangling**) the data to get it into a usable form. We already did a little bit of munging when we sorted the *C. elegans* egg cross sectional areas to be able to compute cumulative histograms.\n",
"\n",
"Pandas is the primary tool in the SciPy stack for handling data. Its primary object, the `DataFrame` is extremely useful in munging data. We will explore some of that functionality here, and will put it to use in the next lesson."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Munging the egg cross-sectional area data\n",
"It os often useful to use the same data set to learn new things, since you are already familiar with the data. We'll keep using the egg cross-sectional area data set. In this case, we will sue `Pandas` to import the data. We use the very handy (and faaaaar more powerful than `np.loadtxt()`) function `pd.read_csv()`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Read in data files with pandas\n",
"df_high = pd.read_csv('../data/xa_high_food.csv', comment='#')\n",
"df_low = pd.read_csv('../data/xa_low_food.csv', comment='#')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Almost the same syntax as `np.loadtxt()`, but notice that the kwarg is `comment` and not `comments`. Let's now look at what we have."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
1683
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
2061
\n",
"
\n",
"
\n",
"
1
\n",
"
1792
\n",
"
\n",
"
\n",
"
2
\n",
"
1852
\n",
"
\n",
"
\n",
"
3
\n",
"
2091
\n",
"
\n",
"
\n",
"
4
\n",
"
1781
\n",
"
\n",
"
\n",
"
5
\n",
"
1912
\n",
"
\n",
"
\n",
"
6
\n",
"
1802
\n",
"
\n",
"
\n",
"
7
\n",
"
1751
\n",
"
\n",
"
\n",
"
8
\n",
"
1731
\n",
"
\n",
"
\n",
"
9
\n",
"
1892
\n",
"
\n",
"
\n",
"
10
\n",
"
1951
\n",
"
\n",
"
\n",
"
11
\n",
"
1809
\n",
"
\n",
"
\n",
"
12
\n",
"
1683
\n",
"
\n",
"
\n",
"
13
\n",
"
1787
\n",
"
\n",
"
\n",
"
14
\n",
"
1840
\n",
"
\n",
"
\n",
"
15
\n",
"
1821
\n",
"
\n",
"
\n",
"
16
\n",
"
1910
\n",
"
\n",
"
\n",
"
17
\n",
"
1930
\n",
"
\n",
"
\n",
"
18
\n",
"
1800
\n",
"
\n",
"
\n",
"
19
\n",
"
1833
\n",
"
\n",
"
\n",
"
20
\n",
"
1683
\n",
"
\n",
"
\n",
"
21
\n",
"
1671
\n",
"
\n",
"
\n",
"
22
\n",
"
1680
\n",
"
\n",
"
\n",
"
23
\n",
"
1692
\n",
"
\n",
"
\n",
"
24
\n",
"
1800
\n",
"
\n",
"
\n",
"
25
\n",
"
1821
\n",
"
\n",
"
\n",
"
26
\n",
"
1882
\n",
"
\n",
"
\n",
"
27
\n",
"
1642
\n",
"
\n",
"
\n",
"
28
\n",
"
1749
\n",
"
\n",
"
\n",
"
29
\n",
"
1712
\n",
"
\n",
"
\n",
"
30
\n",
"
1661
\n",
"
\n",
"
\n",
"
31
\n",
"
1701
\n",
"
\n",
"
\n",
"
32
\n",
"
2141
\n",
"
\n",
"
\n",
"
33
\n",
"
1863
\n",
"
\n",
"
\n",
"
34
\n",
"
1752
\n",
"
\n",
"
\n",
"
35
\n",
"
1740
\n",
"
\n",
"
\n",
"
36
\n",
"
1721
\n",
"
\n",
"
\n",
"
37
\n",
"
1660
\n",
"
\n",
"
\n",
"
38
\n",
"
1930
\n",
"
\n",
"
\n",
"
39
\n",
"
2030
\n",
"
\n",
"
\n",
"
40
\n",
"
1851
\n",
"
\n",
"
\n",
"
41
\n",
"
2131
\n",
"
\n",
"
\n",
"
42
\n",
"
1828
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 1683\n",
"0 2061\n",
"1 1792\n",
"2 1852\n",
"3 2091\n",
"4 1781\n",
"5 1912\n",
"6 1802\n",
"7 1751\n",
"8 1731\n",
"9 1892\n",
"10 1951\n",
"11 1809\n",
"12 1683\n",
"13 1787\n",
"14 1840\n",
"15 1821\n",
"16 1910\n",
"17 1930\n",
"18 1800\n",
"19 1833\n",
"20 1683\n",
"21 1671\n",
"22 1680\n",
"23 1692\n",
"24 1800\n",
"25 1821\n",
"26 1882\n",
"27 1642\n",
"28 1749\n",
"29 1712\n",
"30 1661\n",
"31 1701\n",
"32 2141\n",
"33 1863\n",
"34 1752\n",
"35 1740\n",
"36 1721\n",
"37 1660\n",
"38 1930\n",
"39 2030\n",
"40 1851\n",
"41 2131\n",
"42 1828"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_high"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whoa! IPython is displaying this is a different way! What kind of data type is this?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df_low)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas has loaded the data in as a `DataFrame`. As I mentioned before, this is the central object for handling data.\n",
"\n",
"We see that we have a bold column heading (**`1683`**) and bold row indices. Pandas interpreted the first non-comment line as a label for a column. We need to tell it not to do that using the `header` kwarg. Let's re-load the data."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
0
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1683
\n",
"
\n",
"
\n",
"
1
\n",
"
2061
\n",
"
\n",
"
\n",
"
2
\n",
"
1792
\n",
"
\n",
"
\n",
"
3
\n",
"
1852
\n",
"
\n",
"
\n",
"
4
\n",
"
2091
\n",
"
\n",
"
\n",
"
5
\n",
"
1781
\n",
"
\n",
"
\n",
"
6
\n",
"
1912
\n",
"
\n",
"
\n",
"
7
\n",
"
1802
\n",
"
\n",
"
\n",
"
8
\n",
"
1751
\n",
"
\n",
"
\n",
"
9
\n",
"
1731
\n",
"
\n",
"
\n",
"
10
\n",
"
1892
\n",
"
\n",
"
\n",
"
11
\n",
"
1951
\n",
"
\n",
"
\n",
"
12
\n",
"
1809
\n",
"
\n",
"
\n",
"
13
\n",
"
1683
\n",
"
\n",
"
\n",
"
14
\n",
"
1787
\n",
"
\n",
"
\n",
"
15
\n",
"
1840
\n",
"
\n",
"
\n",
"
16
\n",
"
1821
\n",
"
\n",
"
\n",
"
17
\n",
"
1910
\n",
"
\n",
"
\n",
"
18
\n",
"
1930
\n",
"
\n",
"
\n",
"
19
\n",
"
1800
\n",
"
\n",
"
\n",
"
20
\n",
"
1833
\n",
"
\n",
"
\n",
"
21
\n",
"
1683
\n",
"
\n",
"
\n",
"
22
\n",
"
1671
\n",
"
\n",
"
\n",
"
23
\n",
"
1680
\n",
"
\n",
"
\n",
"
24
\n",
"
1692
\n",
"
\n",
"
\n",
"
25
\n",
"
1800
\n",
"
\n",
"
\n",
"
26
\n",
"
1821
\n",
"
\n",
"
\n",
"
27
\n",
"
1882
\n",
"
\n",
"
\n",
"
28
\n",
"
1642
\n",
"
\n",
"
\n",
"
29
\n",
"
1749
\n",
"
\n",
"
\n",
"
30
\n",
"
1712
\n",
"
\n",
"
\n",
"
31
\n",
"
1661
\n",
"
\n",
"
\n",
"
32
\n",
"
1701
\n",
"
\n",
"
\n",
"
33
\n",
"
2141
\n",
"
\n",
"
\n",
"
34
\n",
"
1863
\n",
"
\n",
"
\n",
"
35
\n",
"
1752
\n",
"
\n",
"
\n",
"
36
\n",
"
1740
\n",
"
\n",
"
\n",
"
37
\n",
"
1721
\n",
"
\n",
"
\n",
"
38
\n",
"
1660
\n",
"
\n",
"
\n",
"
39
\n",
"
1930
\n",
"
\n",
"
\n",
"
40
\n",
"
2030
\n",
"
\n",
"
\n",
"
41
\n",
"
1851
\n",
"
\n",
"
\n",
"
42
\n",
"
2131
\n",
"
\n",
"
\n",
"
43
\n",
"
1828
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0\n",
"0 1683\n",
"1 2061\n",
"2 1792\n",
"3 1852\n",
"4 2091\n",
"5 1781\n",
"6 1912\n",
"7 1802\n",
"8 1751\n",
"9 1731\n",
"10 1892\n",
"11 1951\n",
"12 1809\n",
"13 1683\n",
"14 1787\n",
"15 1840\n",
"16 1821\n",
"17 1910\n",
"18 1930\n",
"19 1800\n",
"20 1833\n",
"21 1683\n",
"22 1671\n",
"23 1680\n",
"24 1692\n",
"25 1800\n",
"26 1821\n",
"27 1882\n",
"28 1642\n",
"29 1749\n",
"30 1712\n",
"31 1661\n",
"32 1701\n",
"33 2141\n",
"34 1863\n",
"35 1752\n",
"36 1740\n",
"37 1721\n",
"38 1660\n",
"39 1930\n",
"40 2030\n",
"41 1851\n",
"42 2131\n",
"43 1828"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Read in data files with pandas with no row headings.\n",
"df_high = pd.read_csv('../data/xa_high_food.csv', comment='#', header=None)\n",
"df_low = pd.read_csv('../data/xa_low_food.csv', comment='#', header=None)\n",
"\n",
"# Now look at it\n",
"df_high"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that Pandas has assigned a column heading of `0` to the column of data. What happens if we index it?"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 1683\n",
"1 2061\n",
"2 1792\n",
"3 1852\n",
"4 2091\n",
"5 1781\n",
"6 1912\n",
"7 1802\n",
"8 1751\n",
"9 1731\n",
"10 1892\n",
"11 1951\n",
"12 1809\n",
"13 1683\n",
"14 1787\n",
"15 1840\n",
"16 1821\n",
"17 1910\n",
"18 1930\n",
"19 1800\n",
"20 1833\n",
"21 1683\n",
"22 1671\n",
"23 1680\n",
"24 1692\n",
"25 1800\n",
"26 1821\n",
"27 1882\n",
"28 1642\n",
"29 1749\n",
"30 1712\n",
"31 1661\n",
"32 1701\n",
"33 2141\n",
"34 1863\n",
"35 1752\n",
"36 1740\n",
"37 1721\n",
"38 1660\n",
"39 1930\n",
"40 2030\n",
"41 1851\n",
"42 2131\n",
"43 1828\n",
"Name: 0, dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_high[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whoa again! What is this?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df_high[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A Pandas `Series` is basically a one-dimensional `DataFrame`. When we index it, we get what we might expect."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1683"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_high[0][0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### An aside: Football and learning what Pandas Series and DataFrames are"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can think of a Pandas `Series` as a generalized NumPy array. NumPy arrays are indexed with integers, but Pandas `Series` may be indexed with anything. As an example, we'll create a `Series` from a dictionary."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Fontaine 13\n",
"Klinsmann 11\n",
"Klose 16\n",
"Koscis 11\n",
"Müller 14\n",
"Pelé 12\n",
"Ronaldo 15\n",
"dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Dictionary of top men's World Cup scorers and how many goals\n",
"wc_dict = {'Klose': 16,\n",
" 'Ronaldo': 15,\n",
" 'Müller': 14,\n",
" 'Fontaine': 13,\n",
" 'Pelé': 12,\n",
" 'Koscis': 11,\n",
" 'Klinsmann': 11}\n",
"\n",
"# Create a Series from the dictionary\n",
"s_goals = pd.Series(wc_dict)\n",
"\n",
"# Take a look\n",
"s_goals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll notice that the indexing now works like the dictionary that created it."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"16"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s_goals['Klose']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, what if we wanted to add more data to this, such as the country that each player represented? We can make another `Series`."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Fontaine France\n",
"Klinsmann Germany\n",
"Klose Germany\n",
"Koscis Hungary\n",
"Müller Germany\n",
"Pelé Brazil\n",
"Ronaldo Brazil\n",
"dtype: object"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Dictionary of nations\n",
"nation_dict = {'Klose': 'Germany',\n",
" 'Ronaldo': 'Brazil',\n",
" 'Müller': 'Germany',\n",
" 'Fontaine': 'France',\n",
" 'Pelé': 'Brazil',\n",
" 'Koscis': 'Hungary',\n",
" 'Klinsmann': 'Germany'}\n",
"\n",
"# Series with nations\n",
"s_nation = pd.Series(nation_dict)\n",
"\n",
"# Look at it\n",
"s_nation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can combine these into a `DataFrame`. We use `pd.DataFrame()` to instantiate a `DataFrame`, passing in a dictionary whose keys are the column headers and values are the series we're building into a `DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
goals
\n",
"
nation
\n",
"
\n",
" \n",
" \n",
"
\n",
"
Fontaine
\n",
"
13
\n",
"
France
\n",
"
\n",
"
\n",
"
Klinsmann
\n",
"
11
\n",
"
Germany
\n",
"
\n",
"
\n",
"
Klose
\n",
"
16
\n",
"
Germany
\n",
"
\n",
"
\n",
"
Koscis
\n",
"
11
\n",
"
Hungary
\n",
"
\n",
"
\n",
"
Müller
\n",
"
14
\n",
"
Germany
\n",
"
\n",
"
\n",
"
Pelé
\n",
"
12
\n",
"
Brazil
\n",
"
\n",
"
\n",
"
Ronaldo
\n",
"
15
\n",
"
Brazil
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" goals nation\n",
"Fontaine 13 France\n",
"Klinsmann 11 Germany\n",
"Klose 16 Germany\n",
"Koscis 11 Hungary\n",
"Müller 14 Germany\n",
"Pelé 12 Brazil\n",
"Ronaldo 15 Brazil"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Combine into a DataFrame\n",
"df_wc = pd.DataFrame({'nation': s_nation, 'goals': s_goals})\n",
"\n",
"# Take a look\n",
"df_wc"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice now that the `DataFrame` is indexed by player names. The column headings are goals and nation. When using the bracket notation, we cannot directly use the indices, only the column headings."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"ename": "KeyError",
"evalue": "'Fontaine'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf_wc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Fontaine'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 1795\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1796\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1797\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_column\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1798\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1799\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_getitem_column\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m_getitem_column\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 1802\u001b[0m \u001b[0;31m# get column\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1803\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_unique\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1804\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_item_cache\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1805\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1806\u001b[0m \u001b[0;31m# duplicate columns & possible reduce dimensionaility\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m_get_item_cache\u001b[0;34m(self, item)\u001b[0m\n\u001b[1;32m 1082\u001b[0m \u001b[0mres\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcache\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1083\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mres\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1084\u001b[0;31m \u001b[0mvalues\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_data\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1085\u001b[0m \u001b[0mres\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_box_item_values\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalues\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1086\u001b[0m \u001b[0mcache\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mres\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/internals.py\u001b[0m in \u001b[0;36mget\u001b[0;34m(self, item, fastpath)\u001b[0m\n\u001b[1;32m 2849\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2850\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misnull\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2851\u001b[0;31m \u001b[0mloc\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2852\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2853\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0misnull\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/index.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method)\u001b[0m\n\u001b[1;32m 1570\u001b[0m \"\"\"\n\u001b[1;32m 1571\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mmethod\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1572\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_values_from_object\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1573\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1574\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmethod\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32mpandas/index.pyx\u001b[0m in \u001b[0;36mpandas.index.IndexEngine.get_loc (pandas/index.c:3824)\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/index.pyx\u001b[0m in \u001b[0;36mpandas.index.IndexEngine.get_loc (pandas/index.c:3704)\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/hashtable.pyx\u001b[0m in \u001b[0;36mpandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12280)\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/hashtable.pyx\u001b[0m in \u001b[0;36mpandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12231)\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: 'Fontaine'"
]
}
],
"source": [
"df_wc['Fontaine']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But we can index by columns."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Fontaine 13\n",
"Klinsmann 11\n",
"Klose 16\n",
"Koscis 11\n",
"Müller 14\n",
"Pelé 12\n",
"Ronaldo 15\n",
"Name: goals, dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_wc['goals']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we just just want the goals and nation of `Fontaine`, we would use the `.ix` method."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"goals 13\n",
"nation France\n",
"Name: Fontaine, dtype: object"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_wc.ix['Fontaine']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can only look at German players, for instance using similar Boolean indexing as with NumPy arrays."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
goals
\n",
"
nation
\n",
"
\n",
" \n",
" \n",
"
\n",
"
Klinsmann
\n",
"
11
\n",
"
Germany
\n",
"
\n",
"
\n",
"
Klose
\n",
"
16
\n",
"
Germany
\n",
"
\n",
"
\n",
"
Müller
\n",
"
14
\n",
"
Germany
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" goals nation\n",
"Klinsmann 11 Germany\n",
"Klose 16 Germany\n",
"Müller 14 Germany"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_wc[df_wc['nation'] == 'Germany']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a final demonstration of the supreme power of `DataFrames`, they come with plotting methods! We can make a bar chart of the goal scorers, first sorting them in order of number of goals scored."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Make a bar chart of goals\n",
"ax = df_wc.sort('goals').plot(y='goals', kind='bar', legend=False)\n",
"plt.ylabel('goals')\n",
"ax.xaxis.grid(False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Combining the cross-sectional area data into a `DataFrame`\n",
"Now, back to our cross-sectional area data. We can combine these data into a `DataFrame` as well, *even though they have differing numbers of data points*. However, because that `df_low` and `df_high` are not `Series`, we cannot just use the method we used for the footballers. Instead, we will use the `pd.concat()` function to concatenate `DataFrame`s.\n",
"\n",
"An added complication is that, as they are now, the two `DataFrames` have the same column heading of `0`. We should change that for each before concatenating."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
high
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1683
\n",
"
\n",
"
\n",
"
1
\n",
"
2061
\n",
"
\n",
"
\n",
"
2
\n",
"
1792
\n",
"
\n",
"
\n",
"
3
\n",
"
1852
\n",
"
\n",
"
\n",
"
4
\n",
"
2091
\n",
"
\n",
"
\n",
"
5
\n",
"
1781
\n",
"
\n",
"
\n",
"
6
\n",
"
1912
\n",
"
\n",
"
\n",
"
7
\n",
"
1802
\n",
"
\n",
"
\n",
"
8
\n",
"
1751
\n",
"
\n",
"
\n",
"
9
\n",
"
1731
\n",
"
\n",
"
\n",
"
10
\n",
"
1892
\n",
"
\n",
"
\n",
"
11
\n",
"
1951
\n",
"
\n",
"
\n",
"
12
\n",
"
1809
\n",
"
\n",
"
\n",
"
13
\n",
"
1683
\n",
"
\n",
"
\n",
"
14
\n",
"
1787
\n",
"
\n",
"
\n",
"
15
\n",
"
1840
\n",
"
\n",
"
\n",
"
16
\n",
"
1821
\n",
"
\n",
"
\n",
"
17
\n",
"
1910
\n",
"
\n",
"
\n",
"
18
\n",
"
1930
\n",
"
\n",
"
\n",
"
19
\n",
"
1800
\n",
"
\n",
"
\n",
"
20
\n",
"
1833
\n",
"
\n",
"
\n",
"
21
\n",
"
1683
\n",
"
\n",
"
\n",
"
22
\n",
"
1671
\n",
"
\n",
"
\n",
"
23
\n",
"
1680
\n",
"
\n",
"
\n",
"
24
\n",
"
1692
\n",
"
\n",
"
\n",
"
25
\n",
"
1800
\n",
"
\n",
"
\n",
"
26
\n",
"
1821
\n",
"
\n",
"
\n",
"
27
\n",
"
1882
\n",
"
\n",
"
\n",
"
28
\n",
"
1642
\n",
"
\n",
"
\n",
"
29
\n",
"
1749
\n",
"
\n",
"
\n",
"
30
\n",
"
1712
\n",
"
\n",
"
\n",
"
31
\n",
"
1661
\n",
"
\n",
"
\n",
"
32
\n",
"
1701
\n",
"
\n",
"
\n",
"
33
\n",
"
2141
\n",
"
\n",
"
\n",
"
34
\n",
"
1863
\n",
"
\n",
"
\n",
"
35
\n",
"
1752
\n",
"
\n",
"
\n",
"
36
\n",
"
1740
\n",
"
\n",
"
\n",
"
37
\n",
"
1721
\n",
"
\n",
"
\n",
"
38
\n",
"
1660
\n",
"
\n",
"
\n",
"
39
\n",
"
1930
\n",
"
\n",
"
\n",
"
40
\n",
"
2030
\n",
"
\n",
"
\n",
"
41
\n",
"
1851
\n",
"
\n",
"
\n",
"
42
\n",
"
2131
\n",
"
\n",
"
\n",
"
43
\n",
"
1828
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" high\n",
"0 1683\n",
"1 2061\n",
"2 1792\n",
"3 1852\n",
"4 2091\n",
"5 1781\n",
"6 1912\n",
"7 1802\n",
"8 1751\n",
"9 1731\n",
"10 1892\n",
"11 1951\n",
"12 1809\n",
"13 1683\n",
"14 1787\n",
"15 1840\n",
"16 1821\n",
"17 1910\n",
"18 1930\n",
"19 1800\n",
"20 1833\n",
"21 1683\n",
"22 1671\n",
"23 1680\n",
"24 1692\n",
"25 1800\n",
"26 1821\n",
"27 1882\n",
"28 1642\n",
"29 1749\n",
"30 1712\n",
"31 1661\n",
"32 1701\n",
"33 2141\n",
"34 1863\n",
"35 1752\n",
"36 1740\n",
"37 1721\n",
"38 1660\n",
"39 1930\n",
"40 2030\n",
"41 1851\n",
"42 2131\n",
"43 1828"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Change column headings\n",
"df_low.columns = ['low']\n",
"df_high.columns = ['high']\n",
"\n",
"# Take a look\n",
"df_high"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can concatenate the two `DataFrame`s into one. We just pass a tuple with the `DataFrame`s we want to concatenate as an argument. We specify the kwarg `axis = 1` to indincate that we want to have two columns, instead of just appending the second `DataFrame` at the bottom of the first (as we would get with `axis = 0`)."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
low
\n",
"
high
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1840
\n",
"
1683
\n",
"
\n",
"
\n",
"
1
\n",
"
2090
\n",
"
2061
\n",
"
\n",
"
\n",
"
2
\n",
"
2169
\n",
"
1792
\n",
"
\n",
"
\n",
"
3
\n",
"
1988
\n",
"
1852
\n",
"
\n",
"
\n",
"
4
\n",
"
2212
\n",
"
2091
\n",
"
\n",
"
\n",
"
5
\n",
"
2339
\n",
"
1781
\n",
"
\n",
"
\n",
"
6
\n",
"
1989
\n",
"
1912
\n",
"
\n",
"
\n",
"
7
\n",
"
2144
\n",
"
1802
\n",
"
\n",
"
\n",
"
8
\n",
"
2290
\n",
"
1751
\n",
"
\n",
"
\n",
"
9
\n",
"
1920
\n",
"
1731
\n",
"
\n",
"
\n",
"
10
\n",
"
2280
\n",
"
1892
\n",
"
\n",
"
\n",
"
11
\n",
"
1809
\n",
"
1951
\n",
"
\n",
"
\n",
"
12
\n",
"
2158
\n",
"
1809
\n",
"
\n",
"
\n",
"
13
\n",
"
1800
\n",
"
1683
\n",
"
\n",
"
\n",
"
14
\n",
"
2133
\n",
"
1787
\n",
"
\n",
"
\n",
"
15
\n",
"
2060
\n",
"
1840
\n",
"
\n",
"
\n",
"
16
\n",
"
2160
\n",
"
1821
\n",
"
\n",
"
\n",
"
17
\n",
"
2001
\n",
"
1910
\n",
"
\n",
"
\n",
"
18
\n",
"
2030
\n",
"
1930
\n",
"
\n",
"
\n",
"
19
\n",
"
2088
\n",
"
1800
\n",
"
\n",
"
\n",
"
20
\n",
"
1951
\n",
"
1833
\n",
"
\n",
"
\n",
"
21
\n",
"
2460
\n",
"
1683
\n",
"
\n",
"
\n",
"
22
\n",
"
2021
\n",
"
1671
\n",
"
\n",
"
\n",
"
23
\n",
"
2010
\n",
"
1680
\n",
"
\n",
"
\n",
"
24
\n",
"
2139
\n",
"
1692
\n",
"
\n",
"
\n",
"
25
\n",
"
2160
\n",
"
1800
\n",
"
\n",
"
\n",
"
26
\n",
"
2106
\n",
"
1821
\n",
"
\n",
"
\n",
"
27
\n",
"
2171
\n",
"
1882
\n",
"
\n",
"
\n",
"
28
\n",
"
2113
\n",
"
1642
\n",
"
\n",
"
\n",
"
29
\n",
"
2179
\n",
"
1749
\n",
"
\n",
"
\n",
"
30
\n",
"
1890
\n",
"
1712
\n",
"
\n",
"
\n",
"
31
\n",
"
2179
\n",
"
1661
\n",
"
\n",
"
\n",
"
32
\n",
"
2021
\n",
"
1701
\n",
"
\n",
"
\n",
"
33
\n",
"
1969
\n",
"
2141
\n",
"
\n",
"
\n",
"
34
\n",
"
2150
\n",
"
1863
\n",
"
\n",
"
\n",
"
35
\n",
"
1900
\n",
"
1752
\n",
"
\n",
"
\n",
"
36
\n",
"
2267
\n",
"
1740
\n",
"
\n",
"
\n",
"
37
\n",
"
1711
\n",
"
1721
\n",
"
\n",
"
\n",
"
38
\n",
"
1901
\n",
"
1660
\n",
"
\n",
"
\n",
"
39
\n",
"
2114
\n",
"
1930
\n",
"
\n",
"
\n",
"
40
\n",
"
2112
\n",
"
2030
\n",
"
\n",
"
\n",
"
41
\n",
"
2361
\n",
"
1851
\n",
"
\n",
"
\n",
"
42
\n",
"
2130
\n",
"
2131
\n",
"
\n",
"
\n",
"
43
\n",
"
2061
\n",
"
1828
\n",
"
\n",
"
\n",
"
44
\n",
"
2121
\n",
"
NaN
\n",
"
\n",
"
\n",
"
45
\n",
"
1832
\n",
"
NaN
\n",
"
\n",
"
\n",
"
46
\n",
"
2210
\n",
"
NaN
\n",
"
\n",
"
\n",
"
47
\n",
"
2130
\n",
"
NaN
\n",
"
\n",
"
\n",
"
48
\n",
"
2153
\n",
"
NaN
\n",
"
\n",
"
\n",
"
49
\n",
"
2009
\n",
"
NaN
\n",
"
\n",
"
\n",
"
50
\n",
"
2100
\n",
"
NaN
\n",
"
\n",
"
\n",
"
51
\n",
"
2252
\n",
"
NaN
\n",
"
\n",
"
\n",
"
52
\n",
"
2143
\n",
"
NaN
\n",
"
\n",
"
\n",
"
53
\n",
"
2252
\n",
"
NaN
\n",
"
\n",
"
\n",
"
54
\n",
"
2222
\n",
"
NaN
\n",
"
\n",
"
\n",
"
55
\n",
"
2121
\n",
"
NaN
\n",
"
\n",
"
\n",
"
56
\n",
"
2409
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" low high\n",
"0 1840 1683\n",
"1 2090 2061\n",
"2 2169 1792\n",
"3 1988 1852\n",
"4 2212 2091\n",
"5 2339 1781\n",
"6 1989 1912\n",
"7 2144 1802\n",
"8 2290 1751\n",
"9 1920 1731\n",
"10 2280 1892\n",
"11 1809 1951\n",
"12 2158 1809\n",
"13 1800 1683\n",
"14 2133 1787\n",
"15 2060 1840\n",
"16 2160 1821\n",
"17 2001 1910\n",
"18 2030 1930\n",
"19 2088 1800\n",
"20 1951 1833\n",
"21 2460 1683\n",
"22 2021 1671\n",
"23 2010 1680\n",
"24 2139 1692\n",
"25 2160 1800\n",
"26 2106 1821\n",
"27 2171 1882\n",
"28 2113 1642\n",
"29 2179 1749\n",
"30 1890 1712\n",
"31 2179 1661\n",
"32 2021 1701\n",
"33 1969 2141\n",
"34 2150 1863\n",
"35 1900 1752\n",
"36 2267 1740\n",
"37 1711 1721\n",
"38 1901 1660\n",
"39 2114 1930\n",
"40 2112 2030\n",
"41 2361 1851\n",
"42 2130 2131\n",
"43 2061 1828\n",
"44 2121 NaN\n",
"45 1832 NaN\n",
"46 2210 NaN\n",
"47 2130 NaN\n",
"48 2153 NaN\n",
"49 2009 NaN\n",
"50 2100 NaN\n",
"51 2252 NaN\n",
"52 2143 NaN\n",
"53 2252 NaN\n",
"54 2222 NaN\n",
"55 2121 NaN\n",
"56 2409 NaN"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Concatenate DataFrames\n",
"df = pd.concat((df_low, df_high), axis=1)\n",
"\n",
"# See the result\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the shorter of the two columns was filled with `NaN`, which means \"not a number.\" This `DataFrame` structure makes it easier to make plots. For example, you may have wondered why we did not make boxplots from the data before. The reason was because it was too syntactically painful without the elegance of `DataFrame`s. We can just directly use the built-in plotting capabilities."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"ax = df.plot(kind='box')\n",
"plt.ylabel('X-sect. area (µm$^2$)')\n",
"ax.xaxis.grid(False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, the red lines are the median, the top and bottom of the box are the 75th and 25th percentile, respectively, and the whiskers cover the total extent of the measurements."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Outputting a new CSV file\n",
"We can now write out a single CSV file with the `DataFrame`. We use the `index` kwarg to ask Pandas now to explicity write the indices to the file."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Write out DataFrame\n",
"df.to_csv('xa_combined.csv', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now when we load the data, we get a convenient `DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
low
\n",
"
high
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1840
\n",
"
1683
\n",
"
\n",
"
\n",
"
1
\n",
"
2090
\n",
"
2061
\n",
"
\n",
"
\n",
"
2
\n",
"
2169
\n",
"
1792
\n",
"
\n",
"
\n",
"
3
\n",
"
1988
\n",
"
1852
\n",
"
\n",
"
\n",
"
4
\n",
"
2212
\n",
"
2091
\n",
"
\n",
"
\n",
"
5
\n",
"
2339
\n",
"
1781
\n",
"
\n",
"
\n",
"
6
\n",
"
1989
\n",
"
1912
\n",
"
\n",
"
\n",
"
7
\n",
"
2144
\n",
"
1802
\n",
"
\n",
"
\n",
"
8
\n",
"
2290
\n",
"
1751
\n",
"
\n",
"
\n",
"
9
\n",
"
1920
\n",
"
1731
\n",
"
\n",
"
\n",
"
10
\n",
"
2280
\n",
"
1892
\n",
"
\n",
"
\n",
"
11
\n",
"
1809
\n",
"
1951
\n",
"
\n",
"
\n",
"
12
\n",
"
2158
\n",
"
1809
\n",
"
\n",
"
\n",
"
13
\n",
"
1800
\n",
"
1683
\n",
"
\n",
"
\n",
"
14
\n",
"
2133
\n",
"
1787
\n",
"
\n",
"
\n",
"
15
\n",
"
2060
\n",
"
1840
\n",
"
\n",
"
\n",
"
16
\n",
"
2160
\n",
"
1821
\n",
"
\n",
"
\n",
"
17
\n",
"
2001
\n",
"
1910
\n",
"
\n",
"
\n",
"
18
\n",
"
2030
\n",
"
1930
\n",
"
\n",
"
\n",
"
19
\n",
"
2088
\n",
"
1800
\n",
"
\n",
"
\n",
"
20
\n",
"
1951
\n",
"
1833
\n",
"
\n",
"
\n",
"
21
\n",
"
2460
\n",
"
1683
\n",
"
\n",
"
\n",
"
22
\n",
"
2021
\n",
"
1671
\n",
"
\n",
"
\n",
"
23
\n",
"
2010
\n",
"
1680
\n",
"
\n",
"
\n",
"
24
\n",
"
2139
\n",
"
1692
\n",
"
\n",
"
\n",
"
25
\n",
"
2160
\n",
"
1800
\n",
"
\n",
"
\n",
"
26
\n",
"
2106
\n",
"
1821
\n",
"
\n",
"
\n",
"
27
\n",
"
2171
\n",
"
1882
\n",
"
\n",
"
\n",
"
28
\n",
"
2113
\n",
"
1642
\n",
"
\n",
"
\n",
"
29
\n",
"
2179
\n",
"
1749
\n",
"
\n",
"
\n",
"
30
\n",
"
1890
\n",
"
1712
\n",
"
\n",
"
\n",
"
31
\n",
"
2179
\n",
"
1661
\n",
"
\n",
"
\n",
"
32
\n",
"
2021
\n",
"
1701
\n",
"
\n",
"
\n",
"
33
\n",
"
1969
\n",
"
2141
\n",
"
\n",
"
\n",
"
34
\n",
"
2150
\n",
"
1863
\n",
"
\n",
"
\n",
"
35
\n",
"
1900
\n",
"
1752
\n",
"
\n",
"
\n",
"
36
\n",
"
2267
\n",
"
1740
\n",
"
\n",
"
\n",
"
37
\n",
"
1711
\n",
"
1721
\n",
"
\n",
"
\n",
"
38
\n",
"
1901
\n",
"
1660
\n",
"
\n",
"
\n",
"
39
\n",
"
2114
\n",
"
1930
\n",
"
\n",
"
\n",
"
40
\n",
"
2112
\n",
"
2030
\n",
"
\n",
"
\n",
"
41
\n",
"
2361
\n",
"
1851
\n",
"
\n",
"
\n",
"
42
\n",
"
2130
\n",
"
2131
\n",
"
\n",
"
\n",
"
43
\n",
"
2061
\n",
"
1828
\n",
"
\n",
"
\n",
"
44
\n",
"
2121
\n",
"
NaN
\n",
"
\n",
"
\n",
"
45
\n",
"
1832
\n",
"
NaN
\n",
"
\n",
"
\n",
"
46
\n",
"
2210
\n",
"
NaN
\n",
"
\n",
"
\n",
"
47
\n",
"
2130
\n",
"
NaN
\n",
"
\n",
"
\n",
"
48
\n",
"
2153
\n",
"
NaN
\n",
"
\n",
"
\n",
"
49
\n",
"
2009
\n",
"
NaN
\n",
"
\n",
"
\n",
"
50
\n",
"
2100
\n",
"
NaN
\n",
"
\n",
"
\n",
"
51
\n",
"
2252
\n",
"
NaN
\n",
"
\n",
"
\n",
"
52
\n",
"
2143
\n",
"
NaN
\n",
"
\n",
"
\n",
"
53
\n",
"
2252
\n",
"
NaN
\n",
"
\n",
"
\n",
"
54
\n",
"
2222
\n",
"
NaN
\n",
"
\n",
"
\n",
"
55
\n",
"
2121
\n",
"
NaN
\n",
"
\n",
"
\n",
"
56
\n",
"
2409
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" low high\n",
"0 1840 1683\n",
"1 2090 2061\n",
"2 2169 1792\n",
"3 1988 1852\n",
"4 2212 2091\n",
"5 2339 1781\n",
"6 1989 1912\n",
"7 2144 1802\n",
"8 2290 1751\n",
"9 1920 1731\n",
"10 2280 1892\n",
"11 1809 1951\n",
"12 2158 1809\n",
"13 1800 1683\n",
"14 2133 1787\n",
"15 2060 1840\n",
"16 2160 1821\n",
"17 2001 1910\n",
"18 2030 1930\n",
"19 2088 1800\n",
"20 1951 1833\n",
"21 2460 1683\n",
"22 2021 1671\n",
"23 2010 1680\n",
"24 2139 1692\n",
"25 2160 1800\n",
"26 2106 1821\n",
"27 2171 1882\n",
"28 2113 1642\n",
"29 2179 1749\n",
"30 1890 1712\n",
"31 2179 1661\n",
"32 2021 1701\n",
"33 1969 2141\n",
"34 2150 1863\n",
"35 1900 1752\n",
"36 2267 1740\n",
"37 1711 1721\n",
"38 1901 1660\n",
"39 2114 1930\n",
"40 2112 2030\n",
"41 2361 1851\n",
"42 2130 2131\n",
"43 2061 1828\n",
"44 2121 NaN\n",
"45 1832 NaN\n",
"46 2210 NaN\n",
"47 2130 NaN\n",
"48 2153 NaN\n",
"49 2009 NaN\n",
"50 2100 NaN\n",
"51 2252 NaN\n",
"52 2143 NaN\n",
"53 2252 NaN\n",
"54 2222 NaN\n",
"55 2121 NaN\n",
"56 2409 NaN"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Load DataFrame\n",
"df_reloaded = pd.read_csv('xa_combined.csv')\n",
"\n",
"# Take a look\n",
"df_reloaded"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Beeswarm plots\n",
"Besides the fact that I hate dashed lines when they are unnecessary, I also believe that you should plot all points when the number of points is below about 100. A good way to do this is a **beeswarm plot**. We'll use `pybeeswarm` to do this."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# pybeeswarm needs list of NumPy arrays for plotting\n",
"list_of_data = [df['low'].dropna(), df['high'].dropna()]\n",
"labels = ['low', 'high']\n",
"\n",
"# Generate a beeswarm plot\n",
"_, ax = beeswarm.beeswarm(list_of_data, labels=labels)\n",
"plt.ylabel('X-sect. area (µm$^2$)')\n",
"ax.xaxis.grid(False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A note on tidy data\n",
"Hadley Wickham wrote a [great article](http://www.jstatsoft.org/v59/i10/paper) in favor of \"tidy data.\" Tidy `DataFrame`s follow the rules:\n",
"1. Each variable is a column.\n",
"2. Each observation is a row.\n",
"3. Each type of observation has its own separate `DataFrame`.\n",
"\n",
"This is less pretty to visualize as a table, but we rarely look at data in tables. In my brief experience since taking Wickham's words to heart, I can say that keeping tidy `DataFrame`s makes life much easier!\n",
"\n",
"A tidy `DataFrame` of the cross-sectional area data would have two columns, \"food density\" and \"cross-sectional area (sq micron).\" Remember, each variable is a column. Each row, then corresponds to an individual measurement. This results in a lot of repeated entries (we'll have 44 `high`s), but it is very clean logically and easy to work with. Let's tidy our `DataFrame`. The `pd.melt()` function makes this easy."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
food density
\n",
"
cross-sectional area (sq micron)
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
low
\n",
"
1840
\n",
"
\n",
"
\n",
"
1
\n",
"
low
\n",
"
2090
\n",
"
\n",
"
\n",
"
2
\n",
"
low
\n",
"
2169
\n",
"
\n",
"
\n",
"
3
\n",
"
low
\n",
"
1988
\n",
"
\n",
"
\n",
"
4
\n",
"
low
\n",
"
2212
\n",
"
\n",
"
\n",
"
5
\n",
"
low
\n",
"
2339
\n",
"
\n",
"
\n",
"
6
\n",
"
low
\n",
"
1989
\n",
"
\n",
"
\n",
"
7
\n",
"
low
\n",
"
2144
\n",
"
\n",
"
\n",
"
8
\n",
"
low
\n",
"
2290
\n",
"
\n",
"
\n",
"
9
\n",
"
low
\n",
"
1920
\n",
"
\n",
"
\n",
"
10
\n",
"
low
\n",
"
2280
\n",
"
\n",
"
\n",
"
11
\n",
"
low
\n",
"
1809
\n",
"
\n",
"
\n",
"
12
\n",
"
low
\n",
"
2158
\n",
"
\n",
"
\n",
"
13
\n",
"
low
\n",
"
1800
\n",
"
\n",
"
\n",
"
14
\n",
"
low
\n",
"
2133
\n",
"
\n",
"
\n",
"
15
\n",
"
low
\n",
"
2060
\n",
"
\n",
"
\n",
"
16
\n",
"
low
\n",
"
2160
\n",
"
\n",
"
\n",
"
17
\n",
"
low
\n",
"
2001
\n",
"
\n",
"
\n",
"
18
\n",
"
low
\n",
"
2030
\n",
"
\n",
"
\n",
"
19
\n",
"
low
\n",
"
2088
\n",
"
\n",
"
\n",
"
20
\n",
"
low
\n",
"
1951
\n",
"
\n",
"
\n",
"
21
\n",
"
low
\n",
"
2460
\n",
"
\n",
"
\n",
"
22
\n",
"
low
\n",
"
2021
\n",
"
\n",
"
\n",
"
23
\n",
"
low
\n",
"
2010
\n",
"
\n",
"
\n",
"
24
\n",
"
low
\n",
"
2139
\n",
"
\n",
"
\n",
"
25
\n",
"
low
\n",
"
2160
\n",
"
\n",
"
\n",
"
26
\n",
"
low
\n",
"
2106
\n",
"
\n",
"
\n",
"
27
\n",
"
low
\n",
"
2171
\n",
"
\n",
"
\n",
"
28
\n",
"
low
\n",
"
2113
\n",
"
\n",
"
\n",
"
29
\n",
"
low
\n",
"
2179
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
71
\n",
"
high
\n",
"
1787
\n",
"
\n",
"
\n",
"
72
\n",
"
high
\n",
"
1840
\n",
"
\n",
"
\n",
"
73
\n",
"
high
\n",
"
1821
\n",
"
\n",
"
\n",
"
74
\n",
"
high
\n",
"
1910
\n",
"
\n",
"
\n",
"
75
\n",
"
high
\n",
"
1930
\n",
"
\n",
"
\n",
"
76
\n",
"
high
\n",
"
1800
\n",
"
\n",
"
\n",
"
77
\n",
"
high
\n",
"
1833
\n",
"
\n",
"
\n",
"
78
\n",
"
high
\n",
"
1683
\n",
"
\n",
"
\n",
"
79
\n",
"
high
\n",
"
1671
\n",
"
\n",
"
\n",
"
80
\n",
"
high
\n",
"
1680
\n",
"
\n",
"
\n",
"
81
\n",
"
high
\n",
"
1692
\n",
"
\n",
"
\n",
"
82
\n",
"
high
\n",
"
1800
\n",
"
\n",
"
\n",
"
83
\n",
"
high
\n",
"
1821
\n",
"
\n",
"
\n",
"
84
\n",
"
high
\n",
"
1882
\n",
"
\n",
"
\n",
"
85
\n",
"
high
\n",
"
1642
\n",
"
\n",
"
\n",
"
86
\n",
"
high
\n",
"
1749
\n",
"
\n",
"
\n",
"
87
\n",
"
high
\n",
"
1712
\n",
"
\n",
"
\n",
"
88
\n",
"
high
\n",
"
1661
\n",
"
\n",
"
\n",
"
89
\n",
"
high
\n",
"
1701
\n",
"
\n",
"
\n",
"
90
\n",
"
high
\n",
"
2141
\n",
"
\n",
"
\n",
"
91
\n",
"
high
\n",
"
1863
\n",
"
\n",
"
\n",
"
92
\n",
"
high
\n",
"
1752
\n",
"
\n",
"
\n",
"
93
\n",
"
high
\n",
"
1740
\n",
"
\n",
"
\n",
"
94
\n",
"
high
\n",
"
1721
\n",
"
\n",
"
\n",
"
95
\n",
"
high
\n",
"
1660
\n",
"
\n",
"
\n",
"
96
\n",
"
high
\n",
"
1930
\n",
"
\n",
"
\n",
"
97
\n",
"
high
\n",
"
2030
\n",
"
\n",
"
\n",
"
98
\n",
"
high
\n",
"
1851
\n",
"
\n",
"
\n",
"
99
\n",
"
high
\n",
"
2131
\n",
"
\n",
"
\n",
"
100
\n",
"
high
\n",
"
1828
\n",
"
\n",
" \n",
"
\n",
"
101 rows × 2 columns
\n",
"
"
],
"text/plain": [
" food density cross-sectional area (sq micron)\n",
"0 low 1840\n",
"1 low 2090\n",
"2 low 2169\n",
"3 low 1988\n",
"4 low 2212\n",
"5 low 2339\n",
"6 low 1989\n",
"7 low 2144\n",
"8 low 2290\n",
"9 low 1920\n",
"10 low 2280\n",
"11 low 1809\n",
"12 low 2158\n",
"13 low 1800\n",
"14 low 2133\n",
"15 low 2060\n",
"16 low 2160\n",
"17 low 2001\n",
"18 low 2030\n",
"19 low 2088\n",
"20 low 1951\n",
"21 low 2460\n",
"22 low 2021\n",
"23 low 2010\n",
"24 low 2139\n",
"25 low 2160\n",
"26 low 2106\n",
"27 low 2171\n",
"28 low 2113\n",
"29 low 2179\n",
".. ... ...\n",
"71 high 1787\n",
"72 high 1840\n",
"73 high 1821\n",
"74 high 1910\n",
"75 high 1930\n",
"76 high 1800\n",
"77 high 1833\n",
"78 high 1683\n",
"79 high 1671\n",
"80 high 1680\n",
"81 high 1692\n",
"82 high 1800\n",
"83 high 1821\n",
"84 high 1882\n",
"85 high 1642\n",
"86 high 1749\n",
"87 high 1712\n",
"88 high 1661\n",
"89 high 1701\n",
"90 high 2141\n",
"91 high 1863\n",
"92 high 1752\n",
"93 high 1740\n",
"94 high 1721\n",
"95 high 1660\n",
"96 high 1930\n",
"97 high 2030\n",
"98 high 1851\n",
"99 high 2131\n",
"100 high 1828\n",
"\n",
"[101 rows x 2 columns]"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.melt(df, var_name='food density', \n",
" value_name='cross-sectional area (sq micron)').dropna()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Saving this `DataFrame` takes a bit more storage, but it is much clearer. For example, it does not rely on comments to give the units of the measurements.\n",
"\n",
"Not to worry, the default plotting still works with this format."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"ax = df.plot(kind='box')\n",
"plt.ylabel('X-sect. area (µm$^2$)')\n",
"ax.xaxis.grid(False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}