{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lesson 14: Tidy data and split-apply-combine\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr>\n",
    "\n",
    "In the last lesson, we learned about Pandas and dipped our toe in to see its power. In this lesson, we will continue to harness the power of Pandas to pull out subsets of data we are interested in."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tidy data\n",
    "\n",
    "[Hadley Wickham](https://en.wikipedia.org/wiki/Hadley_Wickham) wrote a [great article](http://dx.doi.org/10.18637/jss.v059.i10) in favor of \"tidy data.\" Tidy data frames follow the rules:\n",
    "\n",
    "1. Each variable is a column.\n",
    "2. Each observation is a row.\n",
    "3. Each type of observation has its own separate data frame.\n",
    "\n",
    "This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always **much** easier to work with than non-tidy formats.\n",
    "\n",
    "You may raise some objections about tidy data. Here are a few, and my responses.\n",
    "\n",
    "**Objection:** Looking at a table of tidy data is ugly. It is not intuitively organized. I would almost never display a tidy data table in a publication.\n",
    "\n",
    "**Response:** Correct! Having tabular data in a format that is easy to read as a human studying a table is a very different thing than having it in a format that is easy to explore and work with using a computer. As [Daniel Chen](https://chendaniely.github.io) put it, \"There are data formats that are better for reporting and data formats that are better for analysis.\" We are using the tidy data frames for analysis, not reporting (though we will see in the coming lessons that having the data in a tidy format makes making plots much easier, and plots are a key medium for reporting.)\n",
    "\n",
    "<br />\n",
    "\n",
    "**Objection:** Isn't it better to sometimes have data arranged in other ways? Say in a matrix?\n",
    "\n",
    "**Response:** This is certainly true for things like images, or raster-style data in general. It makes more sense to organize an image in a 2D matrix than to have it organized as a data frame with three columns (row in image, column in image, intensity of pixel), where each row corresponds to a single pixel. For an image, indexing it by row and column is always unambiguous, `my_image[i, j]` means the pixel at row `i` and column `j`.\n",
    "\n",
    "For other data, though, the matrix layout suffers from the fact that there may be more than one way to construct a matrix. If you know a data frame is tidy, you already know its structure. You need only to ask what the columns are, and then you immediately know how to access data using Boolean indexing. In other formats, you might have to read and write extensive comments to understand the structure of the data. Of course, you *can* read and write comments, but it opens the door for the possibility of misinterpretation or mistakes.\n",
    "\n",
    "<br />\n",
    "\n",
    "**Objection:** But what about time series? Clearly, that can be in matrix format. One column is time, and then subsequent columns are observations made at that time.\n",
    "\n",
    "**Response:** Yes, that is true. But then the matrix-style described could be considered tidy, since each row is a single observation (time point) that has many facets.\n",
    "\n",
    "<br />\n",
    "\n",
    "**Objection:** Isn't this an inefficient use of memory? There tend to be lots of repeated entries in tidy data frames.\n",
    "\n",
    "**Response:** Yes, there are more efficient ways of storing and accessing data. But for data sets that are not \"big data,\" this is seldom a real issue. The extra expense in memory, as well as the extra expense in access, are small prices to pay for the simplicity and speed of the human user in accessing the data.\n",
    "\n",
    "<br />\n",
    "\n",
    "**Objection:** Once it's tidy, we pretty much have to use Boolean indexing to get what we want, and that can be slower than other methods of accessing data. What about performance?\n",
    "\n",
    "**Response:** See the previous response. Speed of access really only becomes a problem with big, high-throughput data sets. In those cases, there are often many things you need to be clever about beyond organization of your data.\n",
    "\n",
    "<br />\n",
    "\n",
    "**Conclusion:** I really think that tidying a data set allows for fluid exploration. We will focus on tidy data sets going forward. The techniques for bringing untidy data into tidy format use many of Pandas's functions, but are largely beyond the scope of this bootcamp. You will explore that a little bit in the exercises, but for most of the bootcamp, our data sets are already tidy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The data set\n",
    "\n",
    "We will again use the data set from the [Beattie, et al. paper](https://doi.org/10.1098/rsos.160321) on facial matching under sleep deprivation. Let's load in the original data set and add the column on insomnia as we did in the [last lesson](l17_intro_to_pandas.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>participant number</th>\n",
       "      <th>gender</th>\n",
       "      <th>age</th>\n",
       "      <th>correct hit percentage</th>\n",
       "      <th>correct reject percentage</th>\n",
       "      <th>percent correct</th>\n",
       "      <th>confidence when correct hit</th>\n",
       "      <th>confidence when incorrect hit</th>\n",
       "      <th>confidence when correct reject</th>\n",
       "      <th>confidence when incorrect reject</th>\n",
       "      <th>confidence when correct</th>\n",
       "      <th>confidence when incorrect</th>\n",
       "      <th>sci</th>\n",
       "      <th>psqi</th>\n",
       "      <th>ess</th>\n",
       "      <th>insomnia</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>8</td>\n",
       "      <td>f</td>\n",
       "      <td>39</td>\n",
       "      <td>65</td>\n",
       "      <td>80</td>\n",
       "      <td>72.5</td>\n",
       "      <td>91.0</td>\n",
       "      <td>90.0</td>\n",
       "      <td>93.0</td>\n",
       "      <td>83.5</td>\n",
       "      <td>93.0</td>\n",
       "      <td>90.0</td>\n",
       "      <td>9</td>\n",
       "      <td>13</td>\n",
       "      <td>2</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>16</td>\n",
       "      <td>m</td>\n",
       "      <td>42</td>\n",
       "      <td>90</td>\n",
       "      <td>90</td>\n",
       "      <td>90.0</td>\n",
       "      <td>75.5</td>\n",
       "      <td>55.5</td>\n",
       "      <td>70.5</td>\n",
       "      <td>50.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>4</td>\n",
       "      <td>11</td>\n",
       "      <td>7</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>18</td>\n",
       "      <td>f</td>\n",
       "      <td>31</td>\n",
       "      <td>90</td>\n",
       "      <td>95</td>\n",
       "      <td>92.5</td>\n",
       "      <td>89.5</td>\n",
       "      <td>90.0</td>\n",
       "      <td>86.0</td>\n",
       "      <td>81.0</td>\n",
       "      <td>89.0</td>\n",
       "      <td>88.0</td>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>3</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>22</td>\n",
       "      <td>f</td>\n",
       "      <td>35</td>\n",
       "      <td>100</td>\n",
       "      <td>75</td>\n",
       "      <td>87.5</td>\n",
       "      <td>89.5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>71.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>88.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>13</td>\n",
       "      <td>8</td>\n",
       "      <td>20</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>27</td>\n",
       "      <td>f</td>\n",
       "      <td>74</td>\n",
       "      <td>60</td>\n",
       "      <td>65</td>\n",
       "      <td>62.5</td>\n",
       "      <td>68.5</td>\n",
       "      <td>49.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>49.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>49.0</td>\n",
       "      <td>13</td>\n",
       "      <td>9</td>\n",
       "      <td>12</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   participant number gender  age  correct hit percentage  \\\n",
       "0                   8      f   39                      65   \n",
       "1                  16      m   42                      90   \n",
       "2                  18      f   31                      90   \n",
       "3                  22      f   35                     100   \n",
       "4                  27      f   74                      60   \n",
       "\n",
       "   correct reject percentage  percent correct  confidence when correct hit  \\\n",
       "0                         80             72.5                         91.0   \n",
       "1                         90             90.0                         75.5   \n",
       "2                         95             92.5                         89.5   \n",
       "3                         75             87.5                         89.5   \n",
       "4                         65             62.5                         68.5   \n",
       "\n",
       "   confidence when incorrect hit  confidence when correct reject  \\\n",
       "0                           90.0                            93.0   \n",
       "1                           55.5                            70.5   \n",
       "2                           90.0                            86.0   \n",
       "3                            NaN                            71.0   \n",
       "4                           49.0                            61.0   \n",
       "\n",
       "   confidence when incorrect reject  confidence when correct  \\\n",
       "0                              83.5                     93.0   \n",
       "1                              50.0                     75.0   \n",
       "2                              81.0                     89.0   \n",
       "3                              80.0                     88.0   \n",
       "4                              49.0                     65.0   \n",
       "\n",
       "   confidence when incorrect  sci  psqi  ess  insomnia  \n",
       "0                       90.0    9    13    2      True  \n",
       "1                       50.0    4    11    7      True  \n",
       "2                       88.0   10     9    3      True  \n",
       "3                       80.0   13     8   20      True  \n",
       "4                       49.0   13     9   12      True  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('data/gfmt_sleep.csv', na_values='*')\n",
    "df['insomnia'] = df['sci'] <= 16\n",
    "\n",
    "# Take a look\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This data set is in tidy format. Each row represents a single test on a single participant. The aspects of that person's test are given in each column. We already saw the power of having the data in this format when we did Boolean indexing in the last lesson. Now, we will see how this format allows use to easily do an operation we do again and again with data sets, **split-apply-combine**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Split-apply-combine\n",
    "\n",
    "Let's say we want to compute the median percent correct face matchings for subjects with insomnia and the median percent correct face matchings for normal sleepers. Ignoring for the second the mechanics of how we would do this with Python, let's think about it in English. What do we need to do?\n",
    "\n",
    "1. **Split** the data set up according to the `'insomnia'` field, i.e., split it up so we have a separate data set for the two classes of subjects, those with insomnia and those without.\n",
    "2. **Apply** a median function to the activity in these split data sets.\n",
    "3. **Combine** the results of these averages on the split data set into a new, summary data set that contains the two classes (insomniac and not) and medians for each.\n",
    "\n",
    "We see that the strategy we want is a **split-apply-combine** strategy. This idea was put forward by Hadley Wickham in [this paper](http://dx.doi.org/10.18637/jss.v040.i01). It turns out that this is a strategy we want to use *very* often. Split the data in terms of some criterion. Apply some function to the split-up data. Combine the results into a new data frame.\n",
    "\n",
    "Note that if the data are tidy, this procedure makes a lot of sense. Choose the column you want to use to split by. All rows with like entries in the splitting column are then grouped into a new data set. You can then apply any function you want into these new data sets. You can then combine the results into a new data frame.\n",
    "\n",
    "Pandas's split-apply-combine operations are achieved using the `groupby()` method. You can think of `groupby()` as the splitting part. You can then apply functions to the resulting `DataFrameGroupBy` object. The [Pandas documentation on split-apply-combine](http://pandas.pydata.org/pandas-docs/stable/groupby.html) is excellent and worth reading through. It is extensive though, so don't let yourself get intimidated by it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Aggregation: Median percent correct\n",
    "\n",
    "Let's go ahead and do our first split-apply-combine on this tidy data set. First, we will split the data set up by insomnia condition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fa24c82f9d0>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped = df.groupby('insomnia')\n",
    "\n",
    "# Take a look\n",
    "grouped"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is not much to see in the `DataFrameGroupBy` object that resulted. But there is a lot we can do with this object. Typing `grouped.` and hitting tab will show you the many possibilities. For most of these possibilities, the apply and combine steps happen together and a new data frame is returned. The `grouped.median()` method is exactly what we want."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>participant number</th>\n",
       "      <th>age</th>\n",
       "      <th>correct hit percentage</th>\n",
       "      <th>correct reject percentage</th>\n",
       "      <th>percent correct</th>\n",
       "      <th>confidence when correct hit</th>\n",
       "      <th>confidence when incorrect hit</th>\n",
       "      <th>confidence when correct reject</th>\n",
       "      <th>confidence when incorrect reject</th>\n",
       "      <th>confidence when correct</th>\n",
       "      <th>confidence when incorrect</th>\n",
       "      <th>sci</th>\n",
       "      <th>psqi</th>\n",
       "      <th>ess</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>insomnia</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>False</th>\n",
       "      <td>54</td>\n",
       "      <td>36</td>\n",
       "      <td>90</td>\n",
       "      <td>80</td>\n",
       "      <td>85.0</td>\n",
       "      <td>74.5</td>\n",
       "      <td>55.5</td>\n",
       "      <td>71.5</td>\n",
       "      <td>59.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>59.25</td>\n",
       "      <td>26</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>46</td>\n",
       "      <td>39</td>\n",
       "      <td>90</td>\n",
       "      <td>75</td>\n",
       "      <td>75.0</td>\n",
       "      <td>76.5</td>\n",
       "      <td>72.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>68.5</td>\n",
       "      <td>77.0</td>\n",
       "      <td>65.00</td>\n",
       "      <td>14</td>\n",
       "      <td>9</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          participant number  age  correct hit percentage  \\\n",
       "insomnia                                                    \n",
       "False                     54   36                      90   \n",
       "True                      46   39                      90   \n",
       "\n",
       "          correct reject percentage  percent correct  \\\n",
       "insomnia                                               \n",
       "False                            80             85.0   \n",
       "True                             75             75.0   \n",
       "\n",
       "          confidence when correct hit  confidence when incorrect hit  \\\n",
       "insomnia                                                               \n",
       "False                            74.5                           55.5   \n",
       "True                             76.5                           72.0   \n",
       "\n",
       "          confidence when correct reject  confidence when incorrect reject  \\\n",
       "insomnia                                                                     \n",
       "False                               71.5                              59.0   \n",
       "True                                71.0                              68.5   \n",
       "\n",
       "          confidence when correct  confidence when incorrect  sci  psqi  ess  \n",
       "insomnia                                                                      \n",
       "False                        75.0                      59.25   26     4    6  \n",
       "True                         77.0                      65.00   14     9    7  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_median = grouped.median()\n",
    "\n",
    "# Take a look\n",
    "df_median"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The outputted data frame has the medians of all quantities, including the `percent correct` that we wanted. Note that this data frame has `insomnia` as the name of the row index. If we want to instead keep `insomnia` (which, remember, is what we used to split up the data set before we computed the summary statistics) as a column, we can use the `reset_index()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>insomnia</th>\n",
       "      <th>participant number</th>\n",
       "      <th>age</th>\n",
       "      <th>correct hit percentage</th>\n",
       "      <th>correct reject percentage</th>\n",
       "      <th>percent correct</th>\n",
       "      <th>confidence when correct hit</th>\n",
       "      <th>confidence when incorrect hit</th>\n",
       "      <th>confidence when correct reject</th>\n",
       "      <th>confidence when incorrect reject</th>\n",
       "      <th>confidence when correct</th>\n",
       "      <th>confidence when incorrect</th>\n",
       "      <th>sci</th>\n",
       "      <th>psqi</th>\n",
       "      <th>ess</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>False</td>\n",
       "      <td>54</td>\n",
       "      <td>36</td>\n",
       "      <td>90</td>\n",
       "      <td>80</td>\n",
       "      <td>85.0</td>\n",
       "      <td>74.5</td>\n",
       "      <td>55.5</td>\n",
       "      <td>71.5</td>\n",
       "      <td>59.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>59.25</td>\n",
       "      <td>26</td>\n",
       "      <td>4</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>True</td>\n",
       "      <td>46</td>\n",
       "      <td>39</td>\n",
       "      <td>90</td>\n",
       "      <td>75</td>\n",
       "      <td>75.0</td>\n",
       "      <td>76.5</td>\n",
       "      <td>72.0</td>\n",
       "      <td>71.0</td>\n",
       "      <td>68.5</td>\n",
       "      <td>77.0</td>\n",
       "      <td>65.00</td>\n",
       "      <td>14</td>\n",
       "      <td>9</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   insomnia  participant number  age  correct hit percentage  \\\n",
       "0     False                  54   36                      90   \n",
       "1      True                  46   39                      90   \n",
       "\n",
       "   correct reject percentage  percent correct  confidence when correct hit  \\\n",
       "0                         80             85.0                         74.5   \n",
       "1                         75             75.0                         76.5   \n",
       "\n",
       "   confidence when incorrect hit  confidence when correct reject  \\\n",
       "0                           55.5                            71.5   \n",
       "1                           72.0                            71.0   \n",
       "\n",
       "   confidence when incorrect reject  confidence when correct  \\\n",
       "0                              59.0                     75.0   \n",
       "1                              68.5                     77.0   \n",
       "\n",
       "   confidence when incorrect  sci  psqi  ess  \n",
       "0                      59.25   26     4    6  \n",
       "1                      65.00   14     9    7  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_median.reset_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note, though, that this was not done in-place. `df_median` still has an index labeled `'insomnia'`. If you want to update your data frame, you have to explicitly do so."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_median = df_median.reset_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also use multiple columns in our `groupby()` operation. For example, we may wish to look at four groups, male insomniacs, female insomniacs, male non-insomniacs, and female non-insomniacs. To do this, we simply pass in a list of columns into `df.groupby()`. We will **chain the methods**, performing a groupby, applying a median, and then resetting the index of the result, all in one line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gender</th>\n",
       "      <th>insomnia</th>\n",
       "      <th>participant number</th>\n",
       "      <th>age</th>\n",
       "      <th>correct hit percentage</th>\n",
       "      <th>correct reject percentage</th>\n",
       "      <th>percent correct</th>\n",
       "      <th>confidence when correct hit</th>\n",
       "      <th>confidence when incorrect hit</th>\n",
       "      <th>confidence when correct reject</th>\n",
       "      <th>confidence when incorrect reject</th>\n",
       "      <th>confidence when correct</th>\n",
       "      <th>confidence when incorrect</th>\n",
       "      <th>sci</th>\n",
       "      <th>psqi</th>\n",
       "      <th>ess</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>f</td>\n",
       "      <td>False</td>\n",
       "      <td>58.0</td>\n",
       "      <td>36.0</td>\n",
       "      <td>85.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>85.00</td>\n",
       "      <td>74.00</td>\n",
       "      <td>55.00</td>\n",
       "      <td>70.50</td>\n",
       "      <td>60.00</td>\n",
       "      <td>74.00</td>\n",
       "      <td>58.75</td>\n",
       "      <td>26.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>f</td>\n",
       "      <td>True</td>\n",
       "      <td>46.0</td>\n",
       "      <td>39.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>72.50</td>\n",
       "      <td>76.50</td>\n",
       "      <td>73.75</td>\n",
       "      <td>71.00</td>\n",
       "      <td>68.50</td>\n",
       "      <td>77.00</td>\n",
       "      <td>70.50</td>\n",
       "      <td>14.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>m</td>\n",
       "      <td>False</td>\n",
       "      <td>41.0</td>\n",
       "      <td>38.5</td>\n",
       "      <td>90.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>82.50</td>\n",
       "      <td>76.00</td>\n",
       "      <td>57.75</td>\n",
       "      <td>74.25</td>\n",
       "      <td>54.75</td>\n",
       "      <td>76.25</td>\n",
       "      <td>59.25</td>\n",
       "      <td>29.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>m</td>\n",
       "      <td>True</td>\n",
       "      <td>55.5</td>\n",
       "      <td>37.0</td>\n",
       "      <td>95.0</td>\n",
       "      <td>82.5</td>\n",
       "      <td>83.75</td>\n",
       "      <td>83.75</td>\n",
       "      <td>55.50</td>\n",
       "      <td>75.75</td>\n",
       "      <td>73.25</td>\n",
       "      <td>81.25</td>\n",
       "      <td>62.50</td>\n",
       "      <td>14.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  gender  insomnia  participant number   age  correct hit percentage  \\\n",
       "0      f     False                58.0  36.0                    85.0   \n",
       "1      f      True                46.0  39.0                    80.0   \n",
       "2      m     False                41.0  38.5                    90.0   \n",
       "3      m      True                55.5  37.0                    95.0   \n",
       "\n",
       "   correct reject percentage  percent correct  confidence when correct hit  \\\n",
       "0                       80.0            85.00                        74.00   \n",
       "1                       75.0            72.50                        76.50   \n",
       "2                       80.0            82.50                        76.00   \n",
       "3                       82.5            83.75                        83.75   \n",
       "\n",
       "   confidence when incorrect hit  confidence when correct reject  \\\n",
       "0                          55.00                           70.50   \n",
       "1                          73.75                           71.00   \n",
       "2                          57.75                           74.25   \n",
       "3                          55.50                           75.75   \n",
       "\n",
       "   confidence when incorrect reject  confidence when correct  \\\n",
       "0                             60.00                    74.00   \n",
       "1                             68.50                    77.00   \n",
       "2                             54.75                    76.25   \n",
       "3                             73.25                    81.25   \n",
       "\n",
       "   confidence when incorrect   sci  psqi  ess  \n",
       "0                      58.75  26.0   4.0  7.0  \n",
       "1                      70.50  14.0   9.0  7.0  \n",
       "2                      59.25  29.0   3.0  6.0  \n",
       "3                      62.50  14.0   9.0  8.0  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(['gender', 'insomnia']).median().reset_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This type of operation is called an **aggregation**. That is, we split the data set up into groups, and then computed a summary statistic for each group, in this case the median. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transformation\n",
    "\n",
    "Instead of summarizing data in a group with single summary statistics by aggregation, we can also do a **transformation** in which each row gets a new entry within a given group. As a simple example, we could generate a column that gives the rank of each participate in terms of percent correct for an insomniac group versus non-insomniac groups. That is, we perform a rank ordering with the insomniac group and within the non-insomniac group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      11.0\n",
       "1      21.5\n",
       "2      23.0\n",
       "3      19.5\n",
       "4       3.5\n",
       "5       2.0\n",
       "6      17.0\n",
       "7       6.0\n",
       "8      21.5\n",
       "9      12.5\n",
       "10      3.5\n",
       "11     24.0\n",
       "12     15.0\n",
       "13     15.0\n",
       "14      5.0\n",
       "15      8.5\n",
       "16      8.5\n",
       "17     15.0\n",
       "18     18.0\n",
       "19     12.5\n",
       "20      8.5\n",
       "21     25.0\n",
       "22     19.5\n",
       "23      1.0\n",
       "24      8.5\n",
       "25     15.0\n",
       "26     41.0\n",
       "27     19.0\n",
       "28      9.5\n",
       "29     72.0\n",
       "       ... \n",
       "72     52.5\n",
       "73     74.0\n",
       "74     67.5\n",
       "75     76.0\n",
       "76      4.5\n",
       "77     72.0\n",
       "78     25.0\n",
       "79     25.0\n",
       "80     34.0\n",
       "81      1.0\n",
       "82     52.5\n",
       "83     41.0\n",
       "84     52.5\n",
       "85     25.0\n",
       "86      6.5\n",
       "87     15.0\n",
       "88     41.0\n",
       "89     67.5\n",
       "90     41.0\n",
       "91     15.0\n",
       "92     61.0\n",
       "93     61.0\n",
       "94      2.0\n",
       "95     52.5\n",
       "96      4.5\n",
       "97     25.0\n",
       "98     52.5\n",
       "99     19.0\n",
       "100    11.5\n",
       "101     6.5\n",
       "Name: percent correct, Length: 102, dtype: float64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped['percent correct'].rank()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gave us a column of ranks with the indexing of the original data frame preserved. We can put this column into the data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>participant number</th>\n",
       "      <th>gender</th>\n",
       "      <th>age</th>\n",
       "      <th>correct hit percentage</th>\n",
       "      <th>correct reject percentage</th>\n",
       "      <th>percent correct</th>\n",
       "      <th>confidence when correct hit</th>\n",
       "      <th>confidence when incorrect hit</th>\n",
       "      <th>confidence when correct reject</th>\n",
       "      <th>confidence when incorrect reject</th>\n",
       "      <th>confidence when correct</th>\n",
       "      <th>confidence when incorrect</th>\n",
       "      <th>sci</th>\n",
       "      <th>psqi</th>\n",
       "      <th>ess</th>\n",
       "      <th>insomnia</th>\n",
       "      <th>rank grouped by insomnia</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>8</td>\n",
       "      <td>f</td>\n",
       "      <td>39</td>\n",
       "      <td>65</td>\n",
       "      <td>80</td>\n",
       "      <td>72.5</td>\n",
       "      <td>91.0</td>\n",
       "      <td>90.0</td>\n",
       "      <td>93.0</td>\n",
       "      <td>83.5</td>\n",
       "      <td>93.0</td>\n",
       "      <td>90.0</td>\n",
       "      <td>9</td>\n",
       "      <td>13</td>\n",
       "      <td>2</td>\n",
       "      <td>True</td>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>16</td>\n",
       "      <td>m</td>\n",
       "      <td>42</td>\n",
       "      <td>90</td>\n",
       "      <td>90</td>\n",
       "      <td>90.0</td>\n",
       "      <td>75.5</td>\n",
       "      <td>55.5</td>\n",
       "      <td>70.5</td>\n",
       "      <td>50.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>4</td>\n",
       "      <td>11</td>\n",
       "      <td>7</td>\n",
       "      <td>True</td>\n",
       "      <td>21.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>18</td>\n",
       "      <td>f</td>\n",
       "      <td>31</td>\n",
       "      <td>90</td>\n",
       "      <td>95</td>\n",
       "      <td>92.5</td>\n",
       "      <td>89.5</td>\n",
       "      <td>90.0</td>\n",
       "      <td>86.0</td>\n",
       "      <td>81.0</td>\n",
       "      <td>89.0</td>\n",
       "      <td>88.0</td>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>3</td>\n",
       "      <td>True</td>\n",
       "      <td>23.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>22</td>\n",
       "      <td>f</td>\n",
       "      <td>35</td>\n",
       "      <td>100</td>\n",
       "      <td>75</td>\n",
       "      <td>87.5</td>\n",
       "      <td>89.5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>71.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>88.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>13</td>\n",
       "      <td>8</td>\n",
       "      <td>20</td>\n",
       "      <td>True</td>\n",
       "      <td>19.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>27</td>\n",
       "      <td>f</td>\n",
       "      <td>74</td>\n",
       "      <td>60</td>\n",
       "      <td>65</td>\n",
       "      <td>62.5</td>\n",
       "      <td>68.5</td>\n",
       "      <td>49.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>49.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>49.0</td>\n",
       "      <td>13</td>\n",
       "      <td>9</td>\n",
       "      <td>12</td>\n",
       "      <td>True</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   participant number gender  age  correct hit percentage  \\\n",
       "0                   8      f   39                      65   \n",
       "1                  16      m   42                      90   \n",
       "2                  18      f   31                      90   \n",
       "3                  22      f   35                     100   \n",
       "4                  27      f   74                      60   \n",
       "\n",
       "   correct reject percentage  percent correct  confidence when correct hit  \\\n",
       "0                         80             72.5                         91.0   \n",
       "1                         90             90.0                         75.5   \n",
       "2                         95             92.5                         89.5   \n",
       "3                         75             87.5                         89.5   \n",
       "4                         65             62.5                         68.5   \n",
       "\n",
       "   confidence when incorrect hit  confidence when correct reject  \\\n",
       "0                           90.0                            93.0   \n",
       "1                           55.5                            70.5   \n",
       "2                           90.0                            86.0   \n",
       "3                            NaN                            71.0   \n",
       "4                           49.0                            61.0   \n",
       "\n",
       "   confidence when incorrect reject  confidence when correct  \\\n",
       "0                              83.5                     93.0   \n",
       "1                              50.0                     75.0   \n",
       "2                              81.0                     89.0   \n",
       "3                              80.0                     88.0   \n",
       "4                              49.0                     65.0   \n",
       "\n",
       "   confidence when incorrect  sci  psqi  ess  insomnia  \\\n",
       "0                       90.0    9    13    2      True   \n",
       "1                       50.0    4    11    7      True   \n",
       "2                       88.0   10     9    3      True   \n",
       "3                       80.0   13     8   20      True   \n",
       "4                       49.0   13     9   12      True   \n",
       "\n",
       "   rank grouped by insomnia  \n",
       "0                      11.0  \n",
       "1                      21.0  \n",
       "2                      23.0  \n",
       "3                      19.0  \n",
       "4                       3.0  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['rank grouped by insomnia'] = grouped['percent correct'].rank(method='first')\n",
    "\n",
    "# Take a look\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To verify that this worked correctly, and also to show some nice sorting properties of data frames, we will sort the data frame by `insomnia` and then by `percent correct` and make sure the ranks worked accordingly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>insomnia</th>\n",
       "      <th>percent correct</th>\n",
       "      <th>rank grouped by insomnia</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>81</th>\n",
       "      <td>False</td>\n",
       "      <td>40.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>94</th>\n",
       "      <td>False</td>\n",
       "      <td>55.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>False</td>\n",
       "      <td>57.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>76</th>\n",
       "      <td>False</td>\n",
       "      <td>60.0</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>False</td>\n",
       "      <td>60.0</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>86</th>\n",
       "      <td>False</td>\n",
       "      <td>62.5</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>101</th>\n",
       "      <td>False</td>\n",
       "      <td>62.5</td>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41</th>\n",
       "      <td>False</td>\n",
       "      <td>65.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>False</td>\n",
       "      <td>67.5</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50</th>\n",
       "      <td>False</td>\n",
       "      <td>67.5</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>False</td>\n",
       "      <td>70.0</td>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100</th>\n",
       "      <td>False</td>\n",
       "      <td>70.0</td>\n",
       "      <td>12.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>False</td>\n",
       "      <td>72.5</td>\n",
       "      <td>13.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52</th>\n",
       "      <td>False</td>\n",
       "      <td>72.5</td>\n",
       "      <td>14.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>False</td>\n",
       "      <td>72.5</td>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87</th>\n",
       "      <td>False</td>\n",
       "      <td>72.5</td>\n",
       "      <td>16.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>91</th>\n",
       "      <td>False</td>\n",
       "      <td>72.5</td>\n",
       "      <td>17.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>False</td>\n",
       "      <td>75.0</td>\n",
       "      <td>18.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49</th>\n",
       "      <td>False</td>\n",
       "      <td>75.0</td>\n",
       "      <td>19.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>False</td>\n",
       "      <td>75.0</td>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>False</td>\n",
       "      <td>77.5</td>\n",
       "      <td>21.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43</th>\n",
       "      <td>False</td>\n",
       "      <td>77.5</td>\n",
       "      <td>22.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>51</th>\n",
       "      <td>False</td>\n",
       "      <td>77.5</td>\n",
       "      <td>23.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>64</th>\n",
       "      <td>False</td>\n",
       "      <td>77.5</td>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>68</th>\n",
       "      <td>False</td>\n",
       "      <td>77.5</td>\n",
       "      <td>25.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>78</th>\n",
       "      <td>False</td>\n",
       "      <td>77.5</td>\n",
       "      <td>26.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>79</th>\n",
       "      <td>False</td>\n",
       "      <td>77.5</td>\n",
       "      <td>27.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>85</th>\n",
       "      <td>False</td>\n",
       "      <td>77.5</td>\n",
       "      <td>28.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>False</td>\n",
       "      <td>77.5</td>\n",
       "      <td>29.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44</th>\n",
       "      <td>False</td>\n",
       "      <td>80.0</td>\n",
       "      <td>30.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>77</th>\n",
       "      <td>False</td>\n",
       "      <td>95.0</td>\n",
       "      <td>73.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>73</th>\n",
       "      <td>False</td>\n",
       "      <td>97.5</td>\n",
       "      <td>74.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>False</td>\n",
       "      <td>100.0</td>\n",
       "      <td>75.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>70</th>\n",
       "      <td>False</td>\n",
       "      <td>100.0</td>\n",
       "      <td>76.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75</th>\n",
       "      <td>False</td>\n",
       "      <td>100.0</td>\n",
       "      <td>77.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>True</td>\n",
       "      <td>45.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>True</td>\n",
       "      <td>50.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>True</td>\n",
       "      <td>62.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>True</td>\n",
       "      <td>62.5</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>True</td>\n",
       "      <td>65.0</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>True</td>\n",
       "      <td>67.5</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>True</td>\n",
       "      <td>70.0</td>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>True</td>\n",
       "      <td>70.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>True</td>\n",
       "      <td>70.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>True</td>\n",
       "      <td>70.0</td>\n",
       "      <td>10.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>True</td>\n",
       "      <td>72.5</td>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>True</td>\n",
       "      <td>75.0</td>\n",
       "      <td>12.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>True</td>\n",
       "      <td>75.0</td>\n",
       "      <td>13.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>True</td>\n",
       "      <td>80.0</td>\n",
       "      <td>14.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>True</td>\n",
       "      <td>80.0</td>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>True</td>\n",
       "      <td>80.0</td>\n",
       "      <td>16.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>True</td>\n",
       "      <td>82.5</td>\n",
       "      <td>17.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>True</td>\n",
       "      <td>85.0</td>\n",
       "      <td>18.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>True</td>\n",
       "      <td>87.5</td>\n",
       "      <td>19.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>True</td>\n",
       "      <td>87.5</td>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>True</td>\n",
       "      <td>90.0</td>\n",
       "      <td>21.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>True</td>\n",
       "      <td>90.0</td>\n",
       "      <td>22.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>True</td>\n",
       "      <td>92.5</td>\n",
       "      <td>23.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>True</td>\n",
       "      <td>95.0</td>\n",
       "      <td>24.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>True</td>\n",
       "      <td>97.5</td>\n",
       "      <td>25.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>102 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     insomnia  percent correct  rank grouped by insomnia\n",
       "81      False             40.0                       1.0\n",
       "94      False             55.0                       2.0\n",
       "39      False             57.5                       3.0\n",
       "76      False             60.0                       4.0\n",
       "96      False             60.0                       5.0\n",
       "86      False             62.5                       6.0\n",
       "101     False             62.5                       7.0\n",
       "41      False             65.0                       8.0\n",
       "28      False             67.5                       9.0\n",
       "50      False             67.5                      10.0\n",
       "34      False             70.0                      11.0\n",
       "100     False             70.0                      12.0\n",
       "25      False             72.5                      13.0\n",
       "52      False             72.5                      14.0\n",
       "53      False             72.5                      15.0\n",
       "87      False             72.5                      16.0\n",
       "91      False             72.5                      17.0\n",
       "27      False             75.0                      18.0\n",
       "49      False             75.0                      19.0\n",
       "99      False             75.0                      20.0\n",
       "36      False             77.5                      21.0\n",
       "43      False             77.5                      22.0\n",
       "51      False             77.5                      23.0\n",
       "64      False             77.5                      24.0\n",
       "68      False             77.5                      25.0\n",
       "78      False             77.5                      26.0\n",
       "79      False             77.5                      27.0\n",
       "85      False             77.5                      28.0\n",
       "97      False             77.5                      29.0\n",
       "44      False             80.0                      30.0\n",
       "..        ...              ...                       ...\n",
       "77      False             95.0                      73.0\n",
       "73      False             97.5                      74.0\n",
       "33      False            100.0                      75.0\n",
       "70      False            100.0                      76.0\n",
       "75      False            100.0                      77.0\n",
       "23       True             45.0                       1.0\n",
       "5        True             50.0                       2.0\n",
       "4        True             62.5                       3.0\n",
       "10       True             62.5                       4.0\n",
       "14       True             65.0                       5.0\n",
       "7        True             67.5                       6.0\n",
       "15       True             70.0                       7.0\n",
       "16       True             70.0                       8.0\n",
       "20       True             70.0                       9.0\n",
       "24       True             70.0                      10.0\n",
       "0        True             72.5                      11.0\n",
       "9        True             75.0                      12.0\n",
       "19       True             75.0                      13.0\n",
       "12       True             80.0                      14.0\n",
       "13       True             80.0                      15.0\n",
       "17       True             80.0                      16.0\n",
       "6        True             82.5                      17.0\n",
       "18       True             85.0                      18.0\n",
       "3        True             87.5                      19.0\n",
       "22       True             87.5                      20.0\n",
       "1        True             90.0                      21.0\n",
       "8        True             90.0                      22.0\n",
       "2        True             92.5                      23.0\n",
       "11       True             95.0                      24.0\n",
       "21       True             97.5                      25.0\n",
       "\n",
       "[102 rows x 3 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_sorted = df.sort_values(by=['insomnia', 'percent correct'])\n",
    "\n",
    "# Look at relevant columns\n",
    "df_sorted[['insomnia', 'percent correct', 'rank grouped by insomnia']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed it worked!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aggregating and transforming with custom functions\n",
    "\n",
    "Let's say we want to compute the coefficient of variation (CoV, the standard deviation divided by the mean) of data in columns of groups in the data frame. There is no built-in function to do this. We have to write our own function to compute the CoV and then use it with the `agg()` method of a `DataFrameGroupBy` object. In the function below, the values of each column are denoted by `data`.\n",
    "\n",
    "To compute the coefficient of variation, we will use one more Numpy function beyond `np.mean()` that you have already seen, `np.std()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def coeff_of_var(data):\n",
    "    \"\"\"Compute coefficient of variation from an array of data.\"\"\"\n",
    "    return np.std(data) / np.mean(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can apply it as an aggregating function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>participant number</th>\n",
       "      <th>age</th>\n",
       "      <th>correct hit percentage</th>\n",
       "      <th>correct reject percentage</th>\n",
       "      <th>percent correct</th>\n",
       "      <th>confidence when correct hit</th>\n",
       "      <th>confidence when incorrect hit</th>\n",
       "      <th>confidence when correct reject</th>\n",
       "      <th>confidence when incorrect reject</th>\n",
       "      <th>confidence when correct</th>\n",
       "      <th>confidence when incorrect</th>\n",
       "      <th>sci</th>\n",
       "      <th>psqi</th>\n",
       "      <th>ess</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>insomnia</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>False</th>\n",
       "      <td>0.586191</td>\n",
       "      <td>0.384262</td>\n",
       "      <td>0.166784</td>\n",
       "      <td>0.184061</td>\n",
       "      <td>0.138785</td>\n",
       "      <td>0.195978</td>\n",
       "      <td>0.350286</td>\n",
       "      <td>0.204312</td>\n",
       "      <td>0.298216</td>\n",
       "      <td>0.187304</td>\n",
       "      <td>0.262509</td>\n",
       "      <td>0.175245</td>\n",
       "      <td>0.577869</td>\n",
       "      <td>0.571566</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>0.536117</td>\n",
       "      <td>0.313853</td>\n",
       "      <td>0.218834</td>\n",
       "      <td>0.325760</td>\n",
       "      <td>0.171856</td>\n",
       "      <td>0.156219</td>\n",
       "      <td>0.225440</td>\n",
       "      <td>0.222827</td>\n",
       "      <td>0.211512</td>\n",
       "      <td>0.160061</td>\n",
       "      <td>0.197484</td>\n",
       "      <td>0.381907</td>\n",
       "      <td>0.299741</td>\n",
       "      <td>0.681514</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          participant number       age  correct hit percentage  \\\n",
       "insomnia                                                         \n",
       "False               0.586191  0.384262                0.166784   \n",
       "True                0.536117  0.313853                0.218834   \n",
       "\n",
       "          correct reject percentage  percent correct  \\\n",
       "insomnia                                               \n",
       "False                      0.184061         0.138785   \n",
       "True                       0.325760         0.171856   \n",
       "\n",
       "          confidence when correct hit  confidence when incorrect hit  \\\n",
       "insomnia                                                               \n",
       "False                        0.195978                       0.350286   \n",
       "True                         0.156219                       0.225440   \n",
       "\n",
       "          confidence when correct reject  confidence when incorrect reject  \\\n",
       "insomnia                                                                     \n",
       "False                           0.204312                          0.298216   \n",
       "True                            0.222827                          0.211512   \n",
       "\n",
       "          confidence when correct  confidence when incorrect       sci  \\\n",
       "insomnia                                                                 \n",
       "False                    0.187304                   0.262509  0.175245   \n",
       "True                     0.160061                   0.197484  0.381907   \n",
       "\n",
       "              psqi       ess  \n",
       "insomnia                      \n",
       "False     0.577869  0.571566  \n",
       "True      0.299741  0.681514  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grouped.agg(coeff_of_var)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can take a similar approach with a transformation. In this case, we will actually do something useful. We will compute the `y`-values we need for plotting an **empirical cumulative distribution function** (ECDF), which we will use in our upcoming lessons with plotting. Briefly, an ECDF for a data set is defined as follows. The ECDF evaluated at *x* for a set of measurements is defined as\n",
    "\n",
    "> ECDF(_x_) = fraction of measurements ≤ *x*.\n",
    "\n",
    "As you can convince yourself if you work it out, the `y`-values of an ECDF are the rankings of the entries, divided by the total number of entries. We will want to plot ECDFs for each group, so we apply the function for each group. First, we define the function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def ecdf_y(data):\n",
    "    \"\"\"Give y-values of an ECDF for an unsorted column in a data frame.\"\"\"\n",
    "    return data.rank(method='first') / len(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can compute the ECDF on the data set grouped by insomnia state."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>participant number</th>\n",
       "      <th>gender</th>\n",
       "      <th>age</th>\n",
       "      <th>correct hit percentage</th>\n",
       "      <th>correct reject percentage</th>\n",
       "      <th>percent correct</th>\n",
       "      <th>confidence when correct hit</th>\n",
       "      <th>confidence when incorrect hit</th>\n",
       "      <th>confidence when correct reject</th>\n",
       "      <th>confidence when incorrect reject</th>\n",
       "      <th>confidence when correct</th>\n",
       "      <th>confidence when incorrect</th>\n",
       "      <th>sci</th>\n",
       "      <th>psqi</th>\n",
       "      <th>ess</th>\n",
       "      <th>insomnia</th>\n",
       "      <th>rank grouped by insomnia</th>\n",
       "      <th>ecdf_y grouped by insomnia</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>8</td>\n",
       "      <td>f</td>\n",
       "      <td>39</td>\n",
       "      <td>65</td>\n",
       "      <td>80</td>\n",
       "      <td>72.5</td>\n",
       "      <td>91.0</td>\n",
       "      <td>90.0</td>\n",
       "      <td>93.0</td>\n",
       "      <td>83.5</td>\n",
       "      <td>93.0</td>\n",
       "      <td>90.0</td>\n",
       "      <td>9</td>\n",
       "      <td>13</td>\n",
       "      <td>2</td>\n",
       "      <td>True</td>\n",
       "      <td>11.0</td>\n",
       "      <td>0.44</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>16</td>\n",
       "      <td>m</td>\n",
       "      <td>42</td>\n",
       "      <td>90</td>\n",
       "      <td>90</td>\n",
       "      <td>90.0</td>\n",
       "      <td>75.5</td>\n",
       "      <td>55.5</td>\n",
       "      <td>70.5</td>\n",
       "      <td>50.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>4</td>\n",
       "      <td>11</td>\n",
       "      <td>7</td>\n",
       "      <td>True</td>\n",
       "      <td>21.0</td>\n",
       "      <td>0.84</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>18</td>\n",
       "      <td>f</td>\n",
       "      <td>31</td>\n",
       "      <td>90</td>\n",
       "      <td>95</td>\n",
       "      <td>92.5</td>\n",
       "      <td>89.5</td>\n",
       "      <td>90.0</td>\n",
       "      <td>86.0</td>\n",
       "      <td>81.0</td>\n",
       "      <td>89.0</td>\n",
       "      <td>88.0</td>\n",
       "      <td>10</td>\n",
       "      <td>9</td>\n",
       "      <td>3</td>\n",
       "      <td>True</td>\n",
       "      <td>23.0</td>\n",
       "      <td>0.92</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>22</td>\n",
       "      <td>f</td>\n",
       "      <td>35</td>\n",
       "      <td>100</td>\n",
       "      <td>75</td>\n",
       "      <td>87.5</td>\n",
       "      <td>89.5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>71.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>88.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>13</td>\n",
       "      <td>8</td>\n",
       "      <td>20</td>\n",
       "      <td>True</td>\n",
       "      <td>19.0</td>\n",
       "      <td>0.76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>27</td>\n",
       "      <td>f</td>\n",
       "      <td>74</td>\n",
       "      <td>60</td>\n",
       "      <td>65</td>\n",
       "      <td>62.5</td>\n",
       "      <td>68.5</td>\n",
       "      <td>49.0</td>\n",
       "      <td>61.0</td>\n",
       "      <td>49.0</td>\n",
       "      <td>65.0</td>\n",
       "      <td>49.0</td>\n",
       "      <td>13</td>\n",
       "      <td>9</td>\n",
       "      <td>12</td>\n",
       "      <td>True</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   participant number gender  age  correct hit percentage  \\\n",
       "0                   8      f   39                      65   \n",
       "1                  16      m   42                      90   \n",
       "2                  18      f   31                      90   \n",
       "3                  22      f   35                     100   \n",
       "4                  27      f   74                      60   \n",
       "\n",
       "   correct reject percentage  percent correct  confidence when correct hit  \\\n",
       "0                         80             72.5                         91.0   \n",
       "1                         90             90.0                         75.5   \n",
       "2                         95             92.5                         89.5   \n",
       "3                         75             87.5                         89.5   \n",
       "4                         65             62.5                         68.5   \n",
       "\n",
       "   confidence when incorrect hit  confidence when correct reject  \\\n",
       "0                           90.0                            93.0   \n",
       "1                           55.5                            70.5   \n",
       "2                           90.0                            86.0   \n",
       "3                            NaN                            71.0   \n",
       "4                           49.0                            61.0   \n",
       "\n",
       "   confidence when incorrect reject  confidence when correct  \\\n",
       "0                              83.5                     93.0   \n",
       "1                              50.0                     75.0   \n",
       "2                              81.0                     89.0   \n",
       "3                              80.0                     88.0   \n",
       "4                              49.0                     65.0   \n",
       "\n",
       "   confidence when incorrect  sci  psqi  ess  insomnia  \\\n",
       "0                       90.0    9    13    2      True   \n",
       "1                       50.0    4    11    7      True   \n",
       "2                       88.0   10     9    3      True   \n",
       "3                       80.0   13     8   20      True   \n",
       "4                       49.0   13     9   12      True   \n",
       "\n",
       "   rank grouped by insomnia  ecdf_y grouped by insomnia  \n",
       "0                      11.0                        0.44  \n",
       "1                      21.0                        0.84  \n",
       "2                      23.0                        0.92  \n",
       "3                      19.0                        0.76  \n",
       "4                       3.0                        0.12  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['ecdf_y grouped by insomnia'] = grouped['percent correct'].transform(ecdf_y)\n",
    "\n",
    "# Take a look\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You now have tremendous power in your hands. When your data are tidy, you can rapidly accelerate the ubiquitous split-apply-combine methods. Importantly, you now have a logical framework to think about how you slice and dice your data. As a final, simple example, I will show you how to go start to finish after loading the data set into a data frame, splitting by insomnia and gender, and then getting the quartiles and extrema, in addition to the mean and standard deviation, in the percent of correct facial matchings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gender</th>\n",
       "      <th>insomnia</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">f</th>\n",
       "      <th>False</th>\n",
       "      <td>55.0</td>\n",
       "      <td>82.045455</td>\n",
       "      <td>10.844982</td>\n",
       "      <td>55.0</td>\n",
       "      <td>75.000</td>\n",
       "      <td>85.00</td>\n",
       "      <td>88.750</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>19.0</td>\n",
       "      <td>73.947368</td>\n",
       "      <td>13.624829</td>\n",
       "      <td>45.0</td>\n",
       "      <td>67.500</td>\n",
       "      <td>72.50</td>\n",
       "      <td>83.750</td>\n",
       "      <td>95.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">m</th>\n",
       "      <th>False</th>\n",
       "      <td>22.0</td>\n",
       "      <td>80.000000</td>\n",
       "      <td>12.770875</td>\n",
       "      <td>40.0</td>\n",
       "      <td>77.500</td>\n",
       "      <td>82.50</td>\n",
       "      <td>89.375</td>\n",
       "      <td>95.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>True</th>\n",
       "      <td>6.0</td>\n",
       "      <td>82.916667</td>\n",
       "      <td>10.655593</td>\n",
       "      <td>67.5</td>\n",
       "      <td>76.875</td>\n",
       "      <td>83.75</td>\n",
       "      <td>88.750</td>\n",
       "      <td>97.5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 count       mean        std   min     25%    50%     75%  \\\n",
       "gender insomnia                                                             \n",
       "f      False      55.0  82.045455  10.844982  55.0  75.000  85.00  88.750   \n",
       "       True       19.0  73.947368  13.624829  45.0  67.500  72.50  83.750   \n",
       "m      False      22.0  80.000000  12.770875  40.0  77.500  82.50  89.375   \n",
       "       True        6.0  82.916667  10.655593  67.5  76.875  83.75  88.750   \n",
       "\n",
       "                   max  \n",
       "gender insomnia         \n",
       "f      False     100.0  \n",
       "       True       95.0  \n",
       "m      False      95.0  \n",
       "       True       97.5  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(['gender', 'insomnia'])['percent correct'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Yes, that's right. One single, clean, easy to read line of code. In coming tutorials, we will see how to use tidy data to quickly generate plots."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Looping over a GroupBy object\n",
    "\n",
    "While the `GroupBy` methods we have learned so far (like `transform()` and `agg()`) are useful and lead to concise code, we sometimes want to loop over the groups of a `GroupBy` object. This often comes up in plotting applications, as we will see in future lessons. As an example, I will compute the median percent correct for female and males, insomniacs and not (which we already computed using `describe()`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('f', False) :  85.0\n",
      "('f', True) :  72.5\n",
      "('m', False) :  82.5\n",
      "('m', True) :  83.75\n"
     ]
    }
   ],
   "source": [
    "for name, group in df.groupby(['gender', 'insomnia']):\n",
    "    print(name, ': ', group['percent correct'].median())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By using the `GroupBy` object as an iterator, it yields the name of the group (which I assigned as `name`) and the corresponding sub-data frame (which I assigned `group`)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tidying a data set\n",
    "\n",
    "You should always organize your data sets in a tidy format. However, this is sometimes just not possible, since you data sets can come from instruments that do not output the data in tidy format (though most do, at least in my experience), and you often have collaborators that send data in untidy formats.\n",
    "\n",
    "The most useful function for tidying data is `pd.melt()`. To demonstrate this, I will load in a data set from [this paper by Reeves, et al.](https://doi.org/10.1016/j.devcel.2011.12.007) in which they measured the width of the gradient in the morphogen Dorsal in *Drosophila* embryos for various genotypes using different method. First, let's load in the data set and take a look."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"2\" halign=\"left\">wt</th>\n",
       "      <th colspan=\"3\" halign=\"left\">dl1/+; dl-venus/+</th>\n",
       "      <th colspan=\"3\" halign=\"left\">dl1/+; dl-gfp/+</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>wholemounts</th>\n",
       "      <th>cross-sections</th>\n",
       "      <th>anti-Dorsal</th>\n",
       "      <th>anti-Venus</th>\n",
       "      <th>Venus (live)</th>\n",
       "      <th>anti-Dorsal</th>\n",
       "      <th>anti-GFP</th>\n",
       "      <th>GFP (live)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.1288</td>\n",
       "      <td>0.1327</td>\n",
       "      <td>0.1482</td>\n",
       "      <td>0.1632</td>\n",
       "      <td>0.1666</td>\n",
       "      <td>0.2248</td>\n",
       "      <td>0.2389</td>\n",
       "      <td>0.2412</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.1554</td>\n",
       "      <td>0.1457</td>\n",
       "      <td>0.1503</td>\n",
       "      <td>0.1671</td>\n",
       "      <td>0.1753</td>\n",
       "      <td>0.1891</td>\n",
       "      <td>0.2035</td>\n",
       "      <td>0.1942</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.1306</td>\n",
       "      <td>0.1447</td>\n",
       "      <td>0.1577</td>\n",
       "      <td>0.1704</td>\n",
       "      <td>0.1705</td>\n",
       "      <td>0.1705</td>\n",
       "      <td>0.1943</td>\n",
       "      <td>0.2186</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.1413</td>\n",
       "      <td>0.1282</td>\n",
       "      <td>0.1711</td>\n",
       "      <td>0.1779</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.1735</td>\n",
       "      <td>0.2000</td>\n",
       "      <td>0.2104</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.1557</td>\n",
       "      <td>0.1487</td>\n",
       "      <td>0.1342</td>\n",
       "      <td>0.1483</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.2135</td>\n",
       "      <td>0.2560</td>\n",
       "      <td>0.2463</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           wt                dl1/+; dl-venus/+                          \\\n",
       "  wholemounts cross-sections       anti-Dorsal anti-Venus Venus (live)   \n",
       "0      0.1288         0.1327            0.1482     0.1632       0.1666   \n",
       "1      0.1554         0.1457            0.1503     0.1671       0.1753   \n",
       "2      0.1306         0.1447            0.1577     0.1704       0.1705   \n",
       "3      0.1413         0.1282            0.1711     0.1779          NaN   \n",
       "4      0.1557         0.1487            0.1342     0.1483          NaN   \n",
       "\n",
       "  dl1/+; dl-gfp/+                      \n",
       "      anti-Dorsal anti-GFP GFP (live)  \n",
       "0          0.2248   0.2389     0.2412  \n",
       "1          0.1891   0.2035     0.1942  \n",
       "2          0.1705   0.1943     0.2186  \n",
       "3          0.1735   0.2000     0.2104  \n",
       "4          0.2135   0.2560     0.2463  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv('data/reeves_gradient_width_various_methods.csv', comment='#', header=[0,1])\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As is often the case with spreadsheets, we have a **multiindex**, where we have two methods for wild type, three for Venus, and three for GFP. The rows here have no meaning. Clearly these data are not tidy.\n",
    "\n",
    "When we melt the data frame, the data within it, called **values**, become a single column. The headers, called **variables**, also become new columns. So, to melt it, we need to specify what we want to call the values and what we want to call the variable. `pd.melt()` does the rest!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>genotype</th>\n",
       "      <th>method</th>\n",
       "      <th>gradient width</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>wt</td>\n",
       "      <td>wholemounts</td>\n",
       "      <td>0.1288</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>wt</td>\n",
       "      <td>wholemounts</td>\n",
       "      <td>0.1554</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>wt</td>\n",
       "      <td>wholemounts</td>\n",
       "      <td>0.1306</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>wt</td>\n",
       "      <td>wholemounts</td>\n",
       "      <td>0.1413</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>wt</td>\n",
       "      <td>wholemounts</td>\n",
       "      <td>0.1557</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  genotype       method  gradient width\n",
       "0       wt  wholemounts          0.1288\n",
       "1       wt  wholemounts          0.1554\n",
       "2       wt  wholemounts          0.1306\n",
       "3       wt  wholemounts          0.1413\n",
       "4       wt  wholemounts          0.1557"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.melt(df, value_name='gradient width', var_name=['genotype', 'method'])\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nice! We now have a tidy data frame. The only problem is that it has many NaNs because there were many of them in the data set. We can now safely delete any row that has a NaN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "...and we are good to do with a tidy DataFrame! Let's take a look at the summary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th colspan=\"8\" halign=\"left\">gradient width</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>genotype</th>\n",
       "      <th>method</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">dl1/+; dl-gfp/+</th>\n",
       "      <th>GFP (live)</th>\n",
       "      <td>6.0</td>\n",
       "      <td>0.218783</td>\n",
       "      <td>0.021052</td>\n",
       "      <td>0.1942</td>\n",
       "      <td>0.204100</td>\n",
       "      <td>0.21450</td>\n",
       "      <td>0.235550</td>\n",
       "      <td>0.2463</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>anti-Dorsal</th>\n",
       "      <td>20.0</td>\n",
       "      <td>0.198300</td>\n",
       "      <td>0.024468</td>\n",
       "      <td>0.1640</td>\n",
       "      <td>0.177625</td>\n",
       "      <td>0.19540</td>\n",
       "      <td>0.213775</td>\n",
       "      <td>0.2516</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>anti-GFP</th>\n",
       "      <td>20.0</td>\n",
       "      <td>0.226120</td>\n",
       "      <td>0.024664</td>\n",
       "      <td>0.1934</td>\n",
       "      <td>0.203950</td>\n",
       "      <td>0.22620</td>\n",
       "      <td>0.241550</td>\n",
       "      <td>0.2682</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">dl1/+; dl-venus/+</th>\n",
       "      <th>Venus (live)</th>\n",
       "      <td>3.0</td>\n",
       "      <td>0.170800</td>\n",
       "      <td>0.004358</td>\n",
       "      <td>0.1666</td>\n",
       "      <td>0.168550</td>\n",
       "      <td>0.17050</td>\n",
       "      <td>0.172900</td>\n",
       "      <td>0.1753</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>anti-Dorsal</th>\n",
       "      <td>29.0</td>\n",
       "      <td>0.160490</td>\n",
       "      <td>0.010805</td>\n",
       "      <td>0.1342</td>\n",
       "      <td>0.154600</td>\n",
       "      <td>0.16010</td>\n",
       "      <td>0.170100</td>\n",
       "      <td>0.1773</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>anti-Venus</th>\n",
       "      <td>29.0</td>\n",
       "      <td>0.173031</td>\n",
       "      <td>0.008742</td>\n",
       "      <td>0.1483</td>\n",
       "      <td>0.170100</td>\n",
       "      <td>0.17390</td>\n",
       "      <td>0.178000</td>\n",
       "      <td>0.1902</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">wt</th>\n",
       "      <th>cross-sections</th>\n",
       "      <td>152.0</td>\n",
       "      <td>0.148458</td>\n",
       "      <td>0.016141</td>\n",
       "      <td>0.1014</td>\n",
       "      <td>0.136725</td>\n",
       "      <td>0.14695</td>\n",
       "      <td>0.159925</td>\n",
       "      <td>0.1805</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>wholemounts</th>\n",
       "      <td>24.0</td>\n",
       "      <td>0.145279</td>\n",
       "      <td>0.019059</td>\n",
       "      <td>0.1096</td>\n",
       "      <td>0.130450</td>\n",
       "      <td>0.14460</td>\n",
       "      <td>0.158200</td>\n",
       "      <td>0.1779</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                 gradient width                              \\\n",
       "                                          count      mean       std     min   \n",
       "genotype          method                                                      \n",
       "dl1/+; dl-gfp/+   GFP (live)                6.0  0.218783  0.021052  0.1942   \n",
       "                  anti-Dorsal              20.0  0.198300  0.024468  0.1640   \n",
       "                  anti-GFP                 20.0  0.226120  0.024664  0.1934   \n",
       "dl1/+; dl-venus/+ Venus (live)              3.0  0.170800  0.004358  0.1666   \n",
       "                  anti-Dorsal              29.0  0.160490  0.010805  0.1342   \n",
       "                  anti-Venus               29.0  0.173031  0.008742  0.1483   \n",
       "wt                cross-sections          152.0  0.148458  0.016141  0.1014   \n",
       "                  wholemounts              24.0  0.145279  0.019059  0.1096   \n",
       "\n",
       "                                                                       \n",
       "                                       25%      50%       75%     max  \n",
       "genotype          method                                               \n",
       "dl1/+; dl-gfp/+   GFP (live)      0.204100  0.21450  0.235550  0.2463  \n",
       "                  anti-Dorsal     0.177625  0.19540  0.213775  0.2516  \n",
       "                  anti-GFP        0.203950  0.22620  0.241550  0.2682  \n",
       "dl1/+; dl-venus/+ Venus (live)    0.168550  0.17050  0.172900  0.1753  \n",
       "                  anti-Dorsal     0.154600  0.16010  0.170100  0.1773  \n",
       "                  anti-Venus      0.170100  0.17390  0.178000  0.1902  \n",
       "wt                cross-sections  0.136725  0.14695  0.159925  0.1805  \n",
       "                  wholemounts     0.130450  0.14460  0.158200  0.1779  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.groupby(['genotype', 'method']).describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can immediately see that the wild type has a narrower gradient than Venus, which is narrower still than GFP."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPython 3.7.7\n",
      "IPython 7.15.0\n",
      "\n",
      "numpy 1.18.1\n",
      "pandas 0.24.2\n",
      "jupyterlab 2.1.4\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark -v -p numpy,pandas,jupyterlab"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}