{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lesson 30: Introduction to Pandas\n",
"\n",
"(c) 2016 Justin Bois. This work is licensed under a [Creative Commons Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/). All code contained herein is licensed under an [MIT license](https://opensource.org/licenses/MIT).\n",
"\n",
"*This lesson was generated from a Jupyter notebook. You can download the notebook [here](l30_intro_to_pandas.ipynb).*"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"# Pandas, conventionally imported as pd\n",
"import pandas as pd\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"rc={'lines.linewidth': 2, 'axes.labelsize': 18, 'axes.titlesize': 18}\n",
"sns.set(rc=rc)\n",
"\n",
"# The following is specific Jupyter notebooks\n",
"%matplotlib inline\n",
"%config InlineBackend.figure_formats = {'png', 'retina'}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Throughout your research career, you will undoubtedly need to handle data, possibly lots of data. The data comes in lots of formats, and you will spend much of your time **munging** (or **wrangling**) the data to get it into a usable form. We already did a little bit of munging when we sorted the *C. elegans* egg cross sectional areas to be able to compute cumulative histograms.\n",
"\n",
"Pandas is the primary tool in the SciPy stack for handling data. Its primary object, the `DataFrame` is extremely useful in munging data. We will explore some of that functionality here, and will put it to use in the next lesson."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Munging the egg cross-sectional area data\n",
"\n",
"It is often useful to use the same data set to learn new things, since you are already familiar with the data. We'll keep using the egg cross-sectional area data set. In this case, we will use `Pandas` to import the data. We use the very handy (and faaaaar more powerful than `np.loadtxt()`) function `pd.read_csv()`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Read in data files with pandas\n",
"df_high = pd.read_csv('data/xa_high_food.csv', comment='#')\n",
"df_low = pd.read_csv('data/xa_low_food.csv', comment='#')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Almost the same syntax as `np.loadtxt()`, but notice that the kwarg is `comment` and not `comments`. Let's now look at what we have."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" 1683 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2061 | \n",
"
\n",
" \n",
" 1 | \n",
" 1792 | \n",
"
\n",
" \n",
" 2 | \n",
" 1852 | \n",
"
\n",
" \n",
" 3 | \n",
" 2091 | \n",
"
\n",
" \n",
" 4 | \n",
" 1781 | \n",
"
\n",
" \n",
" 5 | \n",
" 1912 | \n",
"
\n",
" \n",
" 6 | \n",
" 1802 | \n",
"
\n",
" \n",
" 7 | \n",
" 1751 | \n",
"
\n",
" \n",
" 8 | \n",
" 1731 | \n",
"
\n",
" \n",
" 9 | \n",
" 1892 | \n",
"
\n",
" \n",
" 10 | \n",
" 1951 | \n",
"
\n",
" \n",
" 11 | \n",
" 1809 | \n",
"
\n",
" \n",
" 12 | \n",
" 1683 | \n",
"
\n",
" \n",
" 13 | \n",
" 1787 | \n",
"
\n",
" \n",
" 14 | \n",
" 1840 | \n",
"
\n",
" \n",
" 15 | \n",
" 1821 | \n",
"
\n",
" \n",
" 16 | \n",
" 1910 | \n",
"
\n",
" \n",
" 17 | \n",
" 1930 | \n",
"
\n",
" \n",
" 18 | \n",
" 1800 | \n",
"
\n",
" \n",
" 19 | \n",
" 1833 | \n",
"
\n",
" \n",
" 20 | \n",
" 1683 | \n",
"
\n",
" \n",
" 21 | \n",
" 1671 | \n",
"
\n",
" \n",
" 22 | \n",
" 1680 | \n",
"
\n",
" \n",
" 23 | \n",
" 1692 | \n",
"
\n",
" \n",
" 24 | \n",
" 1800 | \n",
"
\n",
" \n",
" 25 | \n",
" 1821 | \n",
"
\n",
" \n",
" 26 | \n",
" 1882 | \n",
"
\n",
" \n",
" 27 | \n",
" 1642 | \n",
"
\n",
" \n",
" 28 | \n",
" 1749 | \n",
"
\n",
" \n",
" 29 | \n",
" 1712 | \n",
"
\n",
" \n",
" 30 | \n",
" 1661 | \n",
"
\n",
" \n",
" 31 | \n",
" 1701 | \n",
"
\n",
" \n",
" 32 | \n",
" 2141 | \n",
"
\n",
" \n",
" 33 | \n",
" 1863 | \n",
"
\n",
" \n",
" 34 | \n",
" 1752 | \n",
"
\n",
" \n",
" 35 | \n",
" 1740 | \n",
"
\n",
" \n",
" 36 | \n",
" 1721 | \n",
"
\n",
" \n",
" 37 | \n",
" 1660 | \n",
"
\n",
" \n",
" 38 | \n",
" 1930 | \n",
"
\n",
" \n",
" 39 | \n",
" 2030 | \n",
"
\n",
" \n",
" 40 | \n",
" 1851 | \n",
"
\n",
" \n",
" 41 | \n",
" 2131 | \n",
"
\n",
" \n",
" 42 | \n",
" 1828 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 1683\n",
"0 2061\n",
"1 1792\n",
"2 1852\n",
"3 2091\n",
"4 1781\n",
"5 1912\n",
"6 1802\n",
"7 1751\n",
"8 1731\n",
"9 1892\n",
"10 1951\n",
"11 1809\n",
"12 1683\n",
"13 1787\n",
"14 1840\n",
"15 1821\n",
"16 1910\n",
"17 1930\n",
"18 1800\n",
"19 1833\n",
"20 1683\n",
"21 1671\n",
"22 1680\n",
"23 1692\n",
"24 1800\n",
"25 1821\n",
"26 1882\n",
"27 1642\n",
"28 1749\n",
"29 1712\n",
"30 1661\n",
"31 1701\n",
"32 2141\n",
"33 1863\n",
"34 1752\n",
"35 1740\n",
"36 1721\n",
"37 1660\n",
"38 1930\n",
"39 2030\n",
"40 1851\n",
"41 2131\n",
"42 1828"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_high"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whoa! IPython is displaying this is a different way! What kind of data type is this?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df_low)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas has loaded the data in as a `DataFrame`. As I mentioned before, this is the central object for handling data.\n",
"\n",
"We see that we have a bold column heading (**`1683`**) and bold row indices. Pandas interpreted the first non-comment line as a label for a column. We need to tell it not to do that using the `header` kwarg. Let's re-load the data."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1683 | \n",
"
\n",
" \n",
" 1 | \n",
" 2061 | \n",
"
\n",
" \n",
" 2 | \n",
" 1792 | \n",
"
\n",
" \n",
" 3 | \n",
" 1852 | \n",
"
\n",
" \n",
" 4 | \n",
" 2091 | \n",
"
\n",
" \n",
" 5 | \n",
" 1781 | \n",
"
\n",
" \n",
" 6 | \n",
" 1912 | \n",
"
\n",
" \n",
" 7 | \n",
" 1802 | \n",
"
\n",
" \n",
" 8 | \n",
" 1751 | \n",
"
\n",
" \n",
" 9 | \n",
" 1731 | \n",
"
\n",
" \n",
" 10 | \n",
" 1892 | \n",
"
\n",
" \n",
" 11 | \n",
" 1951 | \n",
"
\n",
" \n",
" 12 | \n",
" 1809 | \n",
"
\n",
" \n",
" 13 | \n",
" 1683 | \n",
"
\n",
" \n",
" 14 | \n",
" 1787 | \n",
"
\n",
" \n",
" 15 | \n",
" 1840 | \n",
"
\n",
" \n",
" 16 | \n",
" 1821 | \n",
"
\n",
" \n",
" 17 | \n",
" 1910 | \n",
"
\n",
" \n",
" 18 | \n",
" 1930 | \n",
"
\n",
" \n",
" 19 | \n",
" 1800 | \n",
"
\n",
" \n",
" 20 | \n",
" 1833 | \n",
"
\n",
" \n",
" 21 | \n",
" 1683 | \n",
"
\n",
" \n",
" 22 | \n",
" 1671 | \n",
"
\n",
" \n",
" 23 | \n",
" 1680 | \n",
"
\n",
" \n",
" 24 | \n",
" 1692 | \n",
"
\n",
" \n",
" 25 | \n",
" 1800 | \n",
"
\n",
" \n",
" 26 | \n",
" 1821 | \n",
"
\n",
" \n",
" 27 | \n",
" 1882 | \n",
"
\n",
" \n",
" 28 | \n",
" 1642 | \n",
"
\n",
" \n",
" 29 | \n",
" 1749 | \n",
"
\n",
" \n",
" 30 | \n",
" 1712 | \n",
"
\n",
" \n",
" 31 | \n",
" 1661 | \n",
"
\n",
" \n",
" 32 | \n",
" 1701 | \n",
"
\n",
" \n",
" 33 | \n",
" 2141 | \n",
"
\n",
" \n",
" 34 | \n",
" 1863 | \n",
"
\n",
" \n",
" 35 | \n",
" 1752 | \n",
"
\n",
" \n",
" 36 | \n",
" 1740 | \n",
"
\n",
" \n",
" 37 | \n",
" 1721 | \n",
"
\n",
" \n",
" 38 | \n",
" 1660 | \n",
"
\n",
" \n",
" 39 | \n",
" 1930 | \n",
"
\n",
" \n",
" 40 | \n",
" 2030 | \n",
"
\n",
" \n",
" 41 | \n",
" 1851 | \n",
"
\n",
" \n",
" 42 | \n",
" 2131 | \n",
"
\n",
" \n",
" 43 | \n",
" 1828 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0\n",
"0 1683\n",
"1 2061\n",
"2 1792\n",
"3 1852\n",
"4 2091\n",
"5 1781\n",
"6 1912\n",
"7 1802\n",
"8 1751\n",
"9 1731\n",
"10 1892\n",
"11 1951\n",
"12 1809\n",
"13 1683\n",
"14 1787\n",
"15 1840\n",
"16 1821\n",
"17 1910\n",
"18 1930\n",
"19 1800\n",
"20 1833\n",
"21 1683\n",
"22 1671\n",
"23 1680\n",
"24 1692\n",
"25 1800\n",
"26 1821\n",
"27 1882\n",
"28 1642\n",
"29 1749\n",
"30 1712\n",
"31 1661\n",
"32 1701\n",
"33 2141\n",
"34 1863\n",
"35 1752\n",
"36 1740\n",
"37 1721\n",
"38 1660\n",
"39 1930\n",
"40 2030\n",
"41 1851\n",
"42 2131\n",
"43 1828"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Read in data files with pandas with no row headings.\n",
"df_high = pd.read_csv('data/xa_high_food.csv', comment='#', header=None)\n",
"df_low = pd.read_csv('data/xa_low_food.csv', comment='#', header=None)\n",
"\n",
"# Now look at it\n",
"df_high"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that Pandas has assigned a column heading of `0` to the column of data. What happens if we index it?"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 1683\n",
"1 2061\n",
"2 1792\n",
"3 1852\n",
"4 2091\n",
"5 1781\n",
"6 1912\n",
"7 1802\n",
"8 1751\n",
"9 1731\n",
"10 1892\n",
"11 1951\n",
"12 1809\n",
"13 1683\n",
"14 1787\n",
"15 1840\n",
"16 1821\n",
"17 1910\n",
"18 1930\n",
"19 1800\n",
"20 1833\n",
"21 1683\n",
"22 1671\n",
"23 1680\n",
"24 1692\n",
"25 1800\n",
"26 1821\n",
"27 1882\n",
"28 1642\n",
"29 1749\n",
"30 1712\n",
"31 1661\n",
"32 1701\n",
"33 2141\n",
"34 1863\n",
"35 1752\n",
"36 1740\n",
"37 1721\n",
"38 1660\n",
"39 1930\n",
"40 2030\n",
"41 1851\n",
"42 2131\n",
"43 1828\n",
"Name: 0, dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_high[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whoa again! What is this?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df_high[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A Pandas `Series` is basically a one-dimensional `DataFrame`. When we index it, we get what we might expect."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1683"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_high[0][0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### An aside: Football and learning what Pandas Series and DataFrames are\n",
"\n",
"We can think of a Pandas `Series` as a generalized NumPy array. NumPy arrays are indexed with integers, but Pandas `Series` may be indexed with anything. As an example, we'll create a `Series` from a dictionary."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Fontaine 13\n",
"Klinsmann 11\n",
"Klose 16\n",
"Kocsis 11\n",
"Müller 14\n",
"Pelé 12\n",
"Ronaldo 15\n",
"dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Dictionary of top men's World Cup scorers and how many goals\n",
"wc_dict = {'Klose': 16,\n",
" 'Ronaldo': 15,\n",
" 'Müller': 14,\n",
" 'Fontaine': 13,\n",
" 'Pelé': 12,\n",
" 'Kocsis': 11,\n",
" 'Klinsmann': 11}\n",
"\n",
"# Create a Series from the dictionary\n",
"s_goals = pd.Series(wc_dict)\n",
"\n",
"# Take a look\n",
"s_goals"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll notice that the indexing now works like the dictionary that created it."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"16"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s_goals['Klose']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, what if we wanted to add more data to this, such as the country that each player represented? We can make another `Series`."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Fontaine France\n",
"Klinsmann Germany\n",
"Klose Germany\n",
"Kocsis Hungary\n",
"Müller Germany\n",
"Pelé Brazil\n",
"Ronaldo Brazil\n",
"dtype: object"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Dictionary of nations\n",
"nation_dict = {'Klose': 'Germany',\n",
" 'Ronaldo': 'Brazil',\n",
" 'Müller': 'Germany',\n",
" 'Fontaine': 'France',\n",
" 'Pelé': 'Brazil',\n",
" 'Kocsis': 'Hungary',\n",
" 'Klinsmann': 'Germany'}\n",
"\n",
"# Series with nations\n",
"s_nation = pd.Series(nation_dict)\n",
"\n",
"# Look at it\n",
"s_nation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can combine these into a `DataFrame`. We use `pd.DataFrame()` to instantiate a `DataFrame`, passing in a dictionary whose keys are the column headers and values are the series we're building into a `DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" goals | \n",
" nation | \n",
"
\n",
" \n",
" \n",
" \n",
" Fontaine | \n",
" 13 | \n",
" France | \n",
"
\n",
" \n",
" Klinsmann | \n",
" 11 | \n",
" Germany | \n",
"
\n",
" \n",
" Klose | \n",
" 16 | \n",
" Germany | \n",
"
\n",
" \n",
" Kocsis | \n",
" 11 | \n",
" Hungary | \n",
"
\n",
" \n",
" Müller | \n",
" 14 | \n",
" Germany | \n",
"
\n",
" \n",
" Pelé | \n",
" 12 | \n",
" Brazil | \n",
"
\n",
" \n",
" Ronaldo | \n",
" 15 | \n",
" Brazil | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" goals nation\n",
"Fontaine 13 France\n",
"Klinsmann 11 Germany\n",
"Klose 16 Germany\n",
"Kocsis 11 Hungary\n",
"Müller 14 Germany\n",
"Pelé 12 Brazil\n",
"Ronaldo 15 Brazil"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Combine into a DataFrame\n",
"df_wc = pd.DataFrame({'nation': s_nation, 'goals': s_goals})\n",
"\n",
"# Take a look\n",
"df_wc"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice now that the `DataFrame` is indexed by player names. The column headings are goals and nation. When using the bracket notation, we cannot directly use the indices, only the column headings."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"ename": "KeyError",
"evalue": "'Fontaine'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf_wc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Fontaine'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 1795\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1796\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1797\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_column\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1798\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1799\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_getitem_column\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m_getitem_column\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 1802\u001b[0m \u001b[0;31m# get column\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1803\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_unique\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1804\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_get_item_cache\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1805\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1806\u001b[0m \u001b[0;31m# duplicate columns & possible reduce dimensionaility\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m_get_item_cache\u001b[0;34m(self, item)\u001b[0m\n\u001b[1;32m 1082\u001b[0m \u001b[0mres\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcache\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1083\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mres\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1084\u001b[0;31m \u001b[0mvalues\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_data\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1085\u001b[0m \u001b[0mres\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_box_item_values\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalues\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1086\u001b[0m \u001b[0mcache\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mres\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/internals.py\u001b[0m in \u001b[0;36mget\u001b[0;34m(self, item, fastpath)\u001b[0m\n\u001b[1;32m 2849\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2850\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misnull\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2851\u001b[0;31m \u001b[0mloc\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mitem\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2852\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2853\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marange\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0misnull\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mitems\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m/Users/Justin/anaconda/lib/python3.4/site-packages/pandas/core/index.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method)\u001b[0m\n\u001b[1;32m 1570\u001b[0m \"\"\"\n\u001b[1;32m 1571\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mmethod\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1572\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_values_from_object\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1573\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1574\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmethod\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mmethod\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32mpandas/index.pyx\u001b[0m in \u001b[0;36mpandas.index.IndexEngine.get_loc (pandas/index.c:3824)\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/index.pyx\u001b[0m in \u001b[0;36mpandas.index.IndexEngine.get_loc (pandas/index.c:3704)\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/hashtable.pyx\u001b[0m in \u001b[0;36mpandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12280)\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/hashtable.pyx\u001b[0m in \u001b[0;36mpandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12231)\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: 'Fontaine'"
]
}
],
"source": [
"df_wc['Fontaine']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But we can index by columns."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Fontaine 13\n",
"Klinsmann 11\n",
"Klose 16\n",
"Kocsis 11\n",
"Müller 14\n",
"Pelé 12\n",
"Ronaldo 15\n",
"Name: goals, dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_wc['goals']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we just just want the goals and nation of `Fontaine`, we would use the `.loc` attribute of a `DataFrame`, which allows slice-like indexing."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"goals 13\n",
"nation France\n",
"Name: Fontaine, dtype: object"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_wc.loc['Fontaine', :]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When using `.loc`, the first entry is the index and the second is the column. So, we asked Pandas to give us the row indexed with `'Fontaine'` and all columns (using the colon, as in NumPy arrays).\n",
"\n",
"We can only look at German players, for instance using similar Boolean indexing as with NumPy arrays."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" goals | \n",
" nation | \n",
"
\n",
" \n",
" \n",
" \n",
" Klinsmann | \n",
" 11 | \n",
" Germany | \n",
"
\n",
" \n",
" Klose | \n",
" 16 | \n",
" Germany | \n",
"
\n",
" \n",
" Müller | \n",
" 14 | \n",
" Germany | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" goals nation\n",
"Klinsmann 11 Germany\n",
"Klose 16 Germany\n",
"Müller 14 Germany"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_wc.loc[df_wc['nation']=='Germany', :]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Combining the cross-sectional area data into a `DataFrame`\n",
"\n",
"Now, back to our cross-sectional area data. We can combine these data into a `DataFrame` as well, *even though they have differing numbers of data points*. However, because that `df_low` and `df_high` are not `Series`, we cannot just use the method we used for the footballers. Instead, we will use the `pd.concat()` function to concatenate `DataFrame`s.\n",
"\n",
"An added complication is that, as they are now, the two `DataFrames` have the same column heading of `0`. We should change that for each before concatenating."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" high | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1683 | \n",
"
\n",
" \n",
" 1 | \n",
" 2061 | \n",
"
\n",
" \n",
" 2 | \n",
" 1792 | \n",
"
\n",
" \n",
" 3 | \n",
" 1852 | \n",
"
\n",
" \n",
" 4 | \n",
" 2091 | \n",
"
\n",
" \n",
" 5 | \n",
" 1781 | \n",
"
\n",
" \n",
" 6 | \n",
" 1912 | \n",
"
\n",
" \n",
" 7 | \n",
" 1802 | \n",
"
\n",
" \n",
" 8 | \n",
" 1751 | \n",
"
\n",
" \n",
" 9 | \n",
" 1731 | \n",
"
\n",
" \n",
" 10 | \n",
" 1892 | \n",
"
\n",
" \n",
" 11 | \n",
" 1951 | \n",
"
\n",
" \n",
" 12 | \n",
" 1809 | \n",
"
\n",
" \n",
" 13 | \n",
" 1683 | \n",
"
\n",
" \n",
" 14 | \n",
" 1787 | \n",
"
\n",
" \n",
" 15 | \n",
" 1840 | \n",
"
\n",
" \n",
" 16 | \n",
" 1821 | \n",
"
\n",
" \n",
" 17 | \n",
" 1910 | \n",
"
\n",
" \n",
" 18 | \n",
" 1930 | \n",
"
\n",
" \n",
" 19 | \n",
" 1800 | \n",
"
\n",
" \n",
" 20 | \n",
" 1833 | \n",
"
\n",
" \n",
" 21 | \n",
" 1683 | \n",
"
\n",
" \n",
" 22 | \n",
" 1671 | \n",
"
\n",
" \n",
" 23 | \n",
" 1680 | \n",
"
\n",
" \n",
" 24 | \n",
" 1692 | \n",
"
\n",
" \n",
" 25 | \n",
" 1800 | \n",
"
\n",
" \n",
" 26 | \n",
" 1821 | \n",
"
\n",
" \n",
" 27 | \n",
" 1882 | \n",
"
\n",
" \n",
" 28 | \n",
" 1642 | \n",
"
\n",
" \n",
" 29 | \n",
" 1749 | \n",
"
\n",
" \n",
" 30 | \n",
" 1712 | \n",
"
\n",
" \n",
" 31 | \n",
" 1661 | \n",
"
\n",
" \n",
" 32 | \n",
" 1701 | \n",
"
\n",
" \n",
" 33 | \n",
" 2141 | \n",
"
\n",
" \n",
" 34 | \n",
" 1863 | \n",
"
\n",
" \n",
" 35 | \n",
" 1752 | \n",
"
\n",
" \n",
" 36 | \n",
" 1740 | \n",
"
\n",
" \n",
" 37 | \n",
" 1721 | \n",
"
\n",
" \n",
" 38 | \n",
" 1660 | \n",
"
\n",
" \n",
" 39 | \n",
" 1930 | \n",
"
\n",
" \n",
" 40 | \n",
" 2030 | \n",
"
\n",
" \n",
" 41 | \n",
" 1851 | \n",
"
\n",
" \n",
" 42 | \n",
" 2131 | \n",
"
\n",
" \n",
" 43 | \n",
" 1828 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" high\n",
"0 1683\n",
"1 2061\n",
"2 1792\n",
"3 1852\n",
"4 2091\n",
"5 1781\n",
"6 1912\n",
"7 1802\n",
"8 1751\n",
"9 1731\n",
"10 1892\n",
"11 1951\n",
"12 1809\n",
"13 1683\n",
"14 1787\n",
"15 1840\n",
"16 1821\n",
"17 1910\n",
"18 1930\n",
"19 1800\n",
"20 1833\n",
"21 1683\n",
"22 1671\n",
"23 1680\n",
"24 1692\n",
"25 1800\n",
"26 1821\n",
"27 1882\n",
"28 1642\n",
"29 1749\n",
"30 1712\n",
"31 1661\n",
"32 1701\n",
"33 2141\n",
"34 1863\n",
"35 1752\n",
"36 1740\n",
"37 1721\n",
"38 1660\n",
"39 1930\n",
"40 2030\n",
"41 1851\n",
"42 2131\n",
"43 1828"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Change column headings\n",
"df_low.columns = ['low']\n",
"df_high.columns = ['high']\n",
"\n",
"# Take a look\n",
"df_high"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can concatenate the two `DataFrame`s into one. We just pass a tuple with the `DataFrame`s we want to concatenate as an argument. We specify the kwarg `axis = 1` to indicate that we want to have two columns, instead of just appending the second `DataFrame` at the bottom of the first (as we would get with `axis = 0`)."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" low | \n",
" high | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1840 | \n",
" 1683.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2090 | \n",
" 2061.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 2169 | \n",
" 1792.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1988 | \n",
" 1852.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 2212 | \n",
" 2091.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 2339 | \n",
" 1781.0 | \n",
"
\n",
" \n",
" 6 | \n",
" 1989 | \n",
" 1912.0 | \n",
"
\n",
" \n",
" 7 | \n",
" 2144 | \n",
" 1802.0 | \n",
"
\n",
" \n",
" 8 | \n",
" 2290 | \n",
" 1751.0 | \n",
"
\n",
" \n",
" 9 | \n",
" 1920 | \n",
" 1731.0 | \n",
"
\n",
" \n",
" 10 | \n",
" 2280 | \n",
" 1892.0 | \n",
"
\n",
" \n",
" 11 | \n",
" 1809 | \n",
" 1951.0 | \n",
"
\n",
" \n",
" 12 | \n",
" 2158 | \n",
" 1809.0 | \n",
"
\n",
" \n",
" 13 | \n",
" 1800 | \n",
" 1683.0 | \n",
"
\n",
" \n",
" 14 | \n",
" 2133 | \n",
" 1787.0 | \n",
"
\n",
" \n",
" 15 | \n",
" 2060 | \n",
" 1840.0 | \n",
"
\n",
" \n",
" 16 | \n",
" 2160 | \n",
" 1821.0 | \n",
"
\n",
" \n",
" 17 | \n",
" 2001 | \n",
" 1910.0 | \n",
"
\n",
" \n",
" 18 | \n",
" 2030 | \n",
" 1930.0 | \n",
"
\n",
" \n",
" 19 | \n",
" 2088 | \n",
" 1800.0 | \n",
"
\n",
" \n",
" 20 | \n",
" 1951 | \n",
" 1833.0 | \n",
"
\n",
" \n",
" 21 | \n",
" 2460 | \n",
" 1683.0 | \n",
"
\n",
" \n",
" 22 | \n",
" 2021 | \n",
" 1671.0 | \n",
"
\n",
" \n",
" 23 | \n",
" 2010 | \n",
" 1680.0 | \n",
"
\n",
" \n",
" 24 | \n",
" 2139 | \n",
" 1692.0 | \n",
"
\n",
" \n",
" 25 | \n",
" 2160 | \n",
" 1800.0 | \n",
"
\n",
" \n",
" 26 | \n",
" 2106 | \n",
" 1821.0 | \n",
"
\n",
" \n",
" 27 | \n",
" 2171 | \n",
" 1882.0 | \n",
"
\n",
" \n",
" 28 | \n",
" 2113 | \n",
" 1642.0 | \n",
"
\n",
" \n",
" 29 | \n",
" 2179 | \n",
" 1749.0 | \n",
"
\n",
" \n",
" 30 | \n",
" 1890 | \n",
" 1712.0 | \n",
"
\n",
" \n",
" 31 | \n",
" 2179 | \n",
" 1661.0 | \n",
"
\n",
" \n",
" 32 | \n",
" 2021 | \n",
" 1701.0 | \n",
"
\n",
" \n",
" 33 | \n",
" 1969 | \n",
" 2141.0 | \n",
"
\n",
" \n",
" 34 | \n",
" 2150 | \n",
" 1863.0 | \n",
"
\n",
" \n",
" 35 | \n",
" 1900 | \n",
" 1752.0 | \n",
"
\n",
" \n",
" 36 | \n",
" 2267 | \n",
" 1740.0 | \n",
"
\n",
" \n",
" 37 | \n",
" 1711 | \n",
" 1721.0 | \n",
"
\n",
" \n",
" 38 | \n",
" 1901 | \n",
" 1660.0 | \n",
"
\n",
" \n",
" 39 | \n",
" 2114 | \n",
" 1930.0 | \n",
"
\n",
" \n",
" 40 | \n",
" 2112 | \n",
" 2030.0 | \n",
"
\n",
" \n",
" 41 | \n",
" 2361 | \n",
" 1851.0 | \n",
"
\n",
" \n",
" 42 | \n",
" 2130 | \n",
" 2131.0 | \n",
"
\n",
" \n",
" 43 | \n",
" 2061 | \n",
" 1828.0 | \n",
"
\n",
" \n",
" 44 | \n",
" 2121 | \n",
" NaN | \n",
"
\n",
" \n",
" 45 | \n",
" 1832 | \n",
" NaN | \n",
"
\n",
" \n",
" 46 | \n",
" 2210 | \n",
" NaN | \n",
"
\n",
" \n",
" 47 | \n",
" 2130 | \n",
" NaN | \n",
"
\n",
" \n",
" 48 | \n",
" 2153 | \n",
" NaN | \n",
"
\n",
" \n",
" 49 | \n",
" 2009 | \n",
" NaN | \n",
"
\n",
" \n",
" 50 | \n",
" 2100 | \n",
" NaN | \n",
"
\n",
" \n",
" 51 | \n",
" 2252 | \n",
" NaN | \n",
"
\n",
" \n",
" 52 | \n",
" 2143 | \n",
" NaN | \n",
"
\n",
" \n",
" 53 | \n",
" 2252 | \n",
" NaN | \n",
"
\n",
" \n",
" 54 | \n",
" 2222 | \n",
" NaN | \n",
"
\n",
" \n",
" 55 | \n",
" 2121 | \n",
" NaN | \n",
"
\n",
" \n",
" 56 | \n",
" 2409 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" low high\n",
"0 1840 1683.0\n",
"1 2090 2061.0\n",
"2 2169 1792.0\n",
"3 1988 1852.0\n",
"4 2212 2091.0\n",
"5 2339 1781.0\n",
"6 1989 1912.0\n",
"7 2144 1802.0\n",
"8 2290 1751.0\n",
"9 1920 1731.0\n",
"10 2280 1892.0\n",
"11 1809 1951.0\n",
"12 2158 1809.0\n",
"13 1800 1683.0\n",
"14 2133 1787.0\n",
"15 2060 1840.0\n",
"16 2160 1821.0\n",
"17 2001 1910.0\n",
"18 2030 1930.0\n",
"19 2088 1800.0\n",
"20 1951 1833.0\n",
"21 2460 1683.0\n",
"22 2021 1671.0\n",
"23 2010 1680.0\n",
"24 2139 1692.0\n",
"25 2160 1800.0\n",
"26 2106 1821.0\n",
"27 2171 1882.0\n",
"28 2113 1642.0\n",
"29 2179 1749.0\n",
"30 1890 1712.0\n",
"31 2179 1661.0\n",
"32 2021 1701.0\n",
"33 1969 2141.0\n",
"34 2150 1863.0\n",
"35 1900 1752.0\n",
"36 2267 1740.0\n",
"37 1711 1721.0\n",
"38 1901 1660.0\n",
"39 2114 1930.0\n",
"40 2112 2030.0\n",
"41 2361 1851.0\n",
"42 2130 2131.0\n",
"43 2061 1828.0\n",
"44 2121 NaN\n",
"45 1832 NaN\n",
"46 2210 NaN\n",
"47 2130 NaN\n",
"48 2153 NaN\n",
"49 2009 NaN\n",
"50 2100 NaN\n",
"51 2252 NaN\n",
"52 2143 NaN\n",
"53 2252 NaN\n",
"54 2222 NaN\n",
"55 2121 NaN\n",
"56 2409 NaN"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Concatenate DataFrames\n",
"df = pd.concat((df_low, df_high), axis=1)\n",
"\n",
"# See the result\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the shorter of the two columns was filled with `NaN`, which means \"not a number.\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Outputting a new CSV file\n",
"\n",
"We can now write out a single CSV file with the `DataFrame`. We use the `index` kwarg to ask Pandas not to explicitly write the indices to the file."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Write out DataFrame\n",
"df.to_csv('xa_combined.csv', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at what this file looks like."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"low,high\r\n",
"1840,1683.0\r\n",
"2090,2061.0\r\n",
"2169,1792.0\r\n",
"1988,1852.0\r\n",
"2212,2091.0\r\n",
"2339,1781.0\r\n",
"1989,1912.0\r\n",
"2144,1802.0\r\n",
"2290,1751.0\r\n"
]
}
],
"source": [
"!head xa_combined.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The headers are now included. Now when we load the data, we get a convenient `DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" low | \n",
" high | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1840 | \n",
" 1683.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 2090 | \n",
" 2061.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 2169 | \n",
" 1792.0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1988 | \n",
" 1852.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 2212 | \n",
" 2091.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 2339 | \n",
" 1781.0 | \n",
"
\n",
" \n",
" 6 | \n",
" 1989 | \n",
" 1912.0 | \n",
"
\n",
" \n",
" 7 | \n",
" 2144 | \n",
" 1802.0 | \n",
"
\n",
" \n",
" 8 | \n",
" 2290 | \n",
" 1751.0 | \n",
"
\n",
" \n",
" 9 | \n",
" 1920 | \n",
" 1731.0 | \n",
"
\n",
" \n",
" 10 | \n",
" 2280 | \n",
" 1892.0 | \n",
"
\n",
" \n",
" 11 | \n",
" 1809 | \n",
" 1951.0 | \n",
"
\n",
" \n",
" 12 | \n",
" 2158 | \n",
" 1809.0 | \n",
"
\n",
" \n",
" 13 | \n",
" 1800 | \n",
" 1683.0 | \n",
"
\n",
" \n",
" 14 | \n",
" 2133 | \n",
" 1787.0 | \n",
"
\n",
" \n",
" 15 | \n",
" 2060 | \n",
" 1840.0 | \n",
"
\n",
" \n",
" 16 | \n",
" 2160 | \n",
" 1821.0 | \n",
"
\n",
" \n",
" 17 | \n",
" 2001 | \n",
" 1910.0 | \n",
"
\n",
" \n",
" 18 | \n",
" 2030 | \n",
" 1930.0 | \n",
"
\n",
" \n",
" 19 | \n",
" 2088 | \n",
" 1800.0 | \n",
"
\n",
" \n",
" 20 | \n",
" 1951 | \n",
" 1833.0 | \n",
"
\n",
" \n",
" 21 | \n",
" 2460 | \n",
" 1683.0 | \n",
"
\n",
" \n",
" 22 | \n",
" 2021 | \n",
" 1671.0 | \n",
"
\n",
" \n",
" 23 | \n",
" 2010 | \n",
" 1680.0 | \n",
"
\n",
" \n",
" 24 | \n",
" 2139 | \n",
" 1692.0 | \n",
"
\n",
" \n",
" 25 | \n",
" 2160 | \n",
" 1800.0 | \n",
"
\n",
" \n",
" 26 | \n",
" 2106 | \n",
" 1821.0 | \n",
"
\n",
" \n",
" 27 | \n",
" 2171 | \n",
" 1882.0 | \n",
"
\n",
" \n",
" 28 | \n",
" 2113 | \n",
" 1642.0 | \n",
"
\n",
" \n",
" 29 | \n",
" 2179 | \n",
" 1749.0 | \n",
"
\n",
" \n",
" 30 | \n",
" 1890 | \n",
" 1712.0 | \n",
"
\n",
" \n",
" 31 | \n",
" 2179 | \n",
" 1661.0 | \n",
"
\n",
" \n",
" 32 | \n",
" 2021 | \n",
" 1701.0 | \n",
"
\n",
" \n",
" 33 | \n",
" 1969 | \n",
" 2141.0 | \n",
"
\n",
" \n",
" 34 | \n",
" 2150 | \n",
" 1863.0 | \n",
"
\n",
" \n",
" 35 | \n",
" 1900 | \n",
" 1752.0 | \n",
"
\n",
" \n",
" 36 | \n",
" 2267 | \n",
" 1740.0 | \n",
"
\n",
" \n",
" 37 | \n",
" 1711 | \n",
" 1721.0 | \n",
"
\n",
" \n",
" 38 | \n",
" 1901 | \n",
" 1660.0 | \n",
"
\n",
" \n",
" 39 | \n",
" 2114 | \n",
" 1930.0 | \n",
"
\n",
" \n",
" 40 | \n",
" 2112 | \n",
" 2030.0 | \n",
"
\n",
" \n",
" 41 | \n",
" 2361 | \n",
" 1851.0 | \n",
"
\n",
" \n",
" 42 | \n",
" 2130 | \n",
" 2131.0 | \n",
"
\n",
" \n",
" 43 | \n",
" 2061 | \n",
" 1828.0 | \n",
"
\n",
" \n",
" 44 | \n",
" 2121 | \n",
" NaN | \n",
"
\n",
" \n",
" 45 | \n",
" 1832 | \n",
" NaN | \n",
"
\n",
" \n",
" 46 | \n",
" 2210 | \n",
" NaN | \n",
"
\n",
" \n",
" 47 | \n",
" 2130 | \n",
" NaN | \n",
"
\n",
" \n",
" 48 | \n",
" 2153 | \n",
" NaN | \n",
"
\n",
" \n",
" 49 | \n",
" 2009 | \n",
" NaN | \n",
"
\n",
" \n",
" 50 | \n",
" 2100 | \n",
" NaN | \n",
"
\n",
" \n",
" 51 | \n",
" 2252 | \n",
" NaN | \n",
"
\n",
" \n",
" 52 | \n",
" 2143 | \n",
" NaN | \n",
"
\n",
" \n",
" 53 | \n",
" 2252 | \n",
" NaN | \n",
"
\n",
" \n",
" 54 | \n",
" 2222 | \n",
" NaN | \n",
"
\n",
" \n",
" 55 | \n",
" 2121 | \n",
" NaN | \n",
"
\n",
" \n",
" 56 | \n",
" 2409 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" low high\n",
"0 1840 1683.0\n",
"1 2090 2061.0\n",
"2 2169 1792.0\n",
"3 1988 1852.0\n",
"4 2212 2091.0\n",
"5 2339 1781.0\n",
"6 1989 1912.0\n",
"7 2144 1802.0\n",
"8 2290 1751.0\n",
"9 1920 1731.0\n",
"10 2280 1892.0\n",
"11 1809 1951.0\n",
"12 2158 1809.0\n",
"13 1800 1683.0\n",
"14 2133 1787.0\n",
"15 2060 1840.0\n",
"16 2160 1821.0\n",
"17 2001 1910.0\n",
"18 2030 1930.0\n",
"19 2088 1800.0\n",
"20 1951 1833.0\n",
"21 2460 1683.0\n",
"22 2021 1671.0\n",
"23 2010 1680.0\n",
"24 2139 1692.0\n",
"25 2160 1800.0\n",
"26 2106 1821.0\n",
"27 2171 1882.0\n",
"28 2113 1642.0\n",
"29 2179 1749.0\n",
"30 1890 1712.0\n",
"31 2179 1661.0\n",
"32 2021 1701.0\n",
"33 1969 2141.0\n",
"34 2150 1863.0\n",
"35 1900 1752.0\n",
"36 2267 1740.0\n",
"37 1711 1721.0\n",
"38 1901 1660.0\n",
"39 2114 1930.0\n",
"40 2112 2030.0\n",
"41 2361 1851.0\n",
"42 2130 2131.0\n",
"43 2061 1828.0\n",
"44 2121 NaN\n",
"45 1832 NaN\n",
"46 2210 NaN\n",
"47 2130 NaN\n",
"48 2153 NaN\n",
"49 2009 NaN\n",
"50 2100 NaN\n",
"51 2252 NaN\n",
"52 2143 NaN\n",
"53 2252 NaN\n",
"54 2222 NaN\n",
"55 2121 NaN\n",
"56 2409 NaN"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Load DataFrame\n",
"df_reloaded = pd.read_csv('xa_combined.csv')\n",
"\n",
"# Take a look\n",
"df_reloaded"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A note on tidy data\n",
"Hadley Wickham wrote a [great article](http://www.jstatsoft.org/v59/i10/paper) in favor of \"tidy data.\" Tidy `DataFrame`s follow the rules:\n",
"1. Each variable is a column.\n",
"2. Each observation is a row.\n",
"3. Each type of observation has its own separate `DataFrame`.\n",
"\n",
"This is less pretty to visualize as a table, but we rarely look at data in tables. In my experience since taking Wickham's words to heart, I can say that keeping tidy `DataFrame`s makes life much easier!\n",
"\n",
"A tidy `DataFrame` of the cross-sectional area data would have two columns, \"food density\" and \"cross-sectional area (sq micron).\" Remember, each variable is a column. Each row, then corresponds to an individual measurement. This results in a lot of repeated entries (we'll have 44 `high`s), but it is very clean logically and easy to work with. Let's tidy our `DataFrame`. The `pd.melt()` function makes this easy."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"df = pd.melt(df, var_name='food density', \n",
" value_name='cross-sectional area (sq micron)').dropna()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we used the `dropna()` method to drop any row that contained a NaN. Saving this `DataFrame` takes a bit more storage, but it is much clearer."
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}