{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9060da5d",
   "metadata": {},
   "source": [
    "Groups & Aggregates\n",
    "===================\n",
    "\n",
    "In this notebook we look at how to change the level that we use to analyze data by **grouping** multiple\n",
    "rows into a single row. This `groupby` function allows us to look at the school data by district,\n",
    "borough, or academic year. To group data, you specifiy the column name or column names for the group. All unique names or combinations will become a single row.\n",
    "\n",
    "The _other_ rows of data need to be **aggregated** in some way. Common aggregate functions include `sum` (e.g. find the total enrolled students in Brooklyn), `count` (find unique schools in Queens), and `mean` (find the average percent poverty of students in the Bronx)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "26694bf0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from nycschools import schools\n",
    "\n",
    "df = schools.load_school_demographics()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21e2d414",
   "metadata": {},
   "source": [
    "Total Enrollment by Boro\n",
    "------------------------------------\n",
    "For this example, we create a simplified data set that only has 2 columns:\n",
    "`boro` and `total_enrollment`. We also limit it to only the 2020 academic year.\n",
    "\n",
    "We call the `groupby()` function of `DataFrame`and pass in \"boro\" as the column name for the group.\n",
    "From there, we call `.sum()` on the resulting GroupBy object. This will sum up the enrollment in each school,\n",
    "in each borough.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b4706c22",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total_enrollment</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>boro</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Bronx</th>\n",
       "      <td>238074</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Brooklyn</th>\n",
       "      <td>346802</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Manhattan</th>\n",
       "      <td>194144</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Queens</th>\n",
       "      <td>305660</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Staten Island</th>\n",
       "      <td>63824</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               total_enrollment\n",
       "boro                           \n",
       "Bronx                    238074\n",
       "Brooklyn                 346802\n",
       "Manhattan                194144\n",
       "Queens                   305660\n",
       "Staten Island             63824"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "boros = df[df.ay == 2020]\n",
    "boros = boros[[\"boro\", \"total_enrollment\"]]\n",
    "\n",
    "boros.groupby(\"boro\").sum()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb86c5f0",
   "metadata": {},
   "source": [
    "We see in th results that Brooklyn had the most students enrolled in 2020 (n=346802) and Staten Island\n",
    "had the fewest (n=63824).\n",
    "\n",
    "We also notice that `total_enrolmment` doesn't appear on the same level as `boro`. If we want to be able to treat `total_enrollment` as a \"normal\" column, we need to call `reset_index()` after our aggregate function.\n",
    "\n",
    "Like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c8c9747b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>boro</th>\n",
       "      <th>total_enrollment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bronx</td>\n",
       "      <td>238074</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Brooklyn</td>\n",
       "      <td>346802</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Manhattan</td>\n",
       "      <td>194144</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Queens</td>\n",
       "      <td>305660</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Staten Island</td>\n",
       "      <td>63824</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            boro  total_enrollment\n",
       "0          Bronx            238074\n",
       "1       Brooklyn            346802\n",
       "2      Manhattan            194144\n",
       "3         Queens            305660\n",
       "4  Staten Island             63824"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "boros.groupby(\"boro\").sum().reset_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "195ac615",
   "metadata": {},
   "source": [
    "Using a `dict` of aggregate columns\n",
    "----------------------------------------------------\n",
    "Now, we might have a few more questions about this simple measure of school enrollment. For example, what was the average enrollment, and how many different schools are in each borough?\n",
    "\n",
    "We can answer these questions by running by adding more columns (namely `dbn`) to our data and using different aggregate functions. In this case we will use `sum`, `count`, `mean`, `min`, and `max`.\n",
    "\n",
    "To accomplish this, we call the `agg` function on the results of the groupby and pass in a dictionary\n",
    "where the column names are keys, and the aggregate function(s) are the values. If we want only a single agg \n",
    "function, we can use that as the value, but if we want multiple functions, we can set a list of strings\n",
    "as the value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "5bef5111",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>dbn</th>\n",
       "      <th colspan=\"4\" halign=\"left\">total_enrollment</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>sum</th>\n",
       "      <th>mean</th>\n",
       "      <th>min</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>boro</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Bronx</th>\n",
       "      <td>496</td>\n",
       "      <td>238074</td>\n",
       "      <td>479.987903</td>\n",
       "      <td>7</td>\n",
       "      <td>2936</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Brooklyn</th>\n",
       "      <td>640</td>\n",
       "      <td>346802</td>\n",
       "      <td>541.878125</td>\n",
       "      <td>46</td>\n",
       "      <td>5921</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Manhattan</th>\n",
       "      <td>408</td>\n",
       "      <td>194144</td>\n",
       "      <td>475.843137</td>\n",
       "      <td>14</td>\n",
       "      <td>3342</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Queens</th>\n",
       "      <td>421</td>\n",
       "      <td>305660</td>\n",
       "      <td>726.033254</td>\n",
       "      <td>42</td>\n",
       "      <td>4424</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Staten Island</th>\n",
       "      <td>83</td>\n",
       "      <td>63824</td>\n",
       "      <td>768.963855</td>\n",
       "      <td>144</td>\n",
       "      <td>3726</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                dbn total_enrollment                       \n",
       "              count              sum        mean  min   max\n",
       "boro                                                       \n",
       "Bronx           496           238074  479.987903    7  2936\n",
       "Brooklyn        640           346802  541.878125   46  5921\n",
       "Manhattan       408           194144  475.843137   14  3342\n",
       "Queens          421           305660  726.033254   42  4424\n",
       "Staten Island    83            63824  768.963855  144  3726"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "boros = df[df.ay == 2020]\n",
    "boros = boros[[\"boro\", \"dbn\", \"total_enrollment\"]]\n",
    "boros = boros.groupby(\"boro\").agg({\n",
    "    \"dbn\":\"count\", \n",
    "    \"total_enrollment\":[\"sum\", \"mean\", \"min\", \"max\"]})\n",
    "boros"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4e8caf5",
   "metadata": {},
   "source": [
    "Resetting the index\n",
    "----------------------------\n",
    "In the results above, we say that the aggregate columns do not appear in the same row as the grouped column, boro. Also, there are 2 levels to the aggregate index because we have 4 columns for total_enrollment. Sometimes we want to convert these columnbs back into a flat index. This makes it easier to filter, merge, and sort the data and to access specific Series in the DataFrame. Unfortunately, we can't simply call `reset_index()` like we did above.\n",
    "\n",
    "Here after we reset the index, we rename all of the columns with a new list of strings. The length of the new list musth match the number of columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "45127262",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Borough</th>\n",
       "      <th>Num. Schools</th>\n",
       "      <th>Total Students</th>\n",
       "      <th>Avg School Size</th>\n",
       "      <th>Smallest School</th>\n",
       "      <th>Largest School</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bronx</td>\n",
       "      <td>496</td>\n",
       "      <td>238074</td>\n",
       "      <td>479.987903</td>\n",
       "      <td>7</td>\n",
       "      <td>2936</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Brooklyn</td>\n",
       "      <td>640</td>\n",
       "      <td>346802</td>\n",
       "      <td>541.878125</td>\n",
       "      <td>46</td>\n",
       "      <td>5921</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Manhattan</td>\n",
       "      <td>408</td>\n",
       "      <td>194144</td>\n",
       "      <td>475.843137</td>\n",
       "      <td>14</td>\n",
       "      <td>3342</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Queens</td>\n",
       "      <td>421</td>\n",
       "      <td>305660</td>\n",
       "      <td>726.033254</td>\n",
       "      <td>42</td>\n",
       "      <td>4424</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Staten Island</td>\n",
       "      <td>83</td>\n",
       "      <td>63824</td>\n",
       "      <td>768.963855</td>\n",
       "      <td>144</td>\n",
       "      <td>3726</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Borough  Num. Schools  Total Students  Avg School Size  \\\n",
       "0          Bronx           496          238074       479.987903   \n",
       "1       Brooklyn           640          346802       541.878125   \n",
       "2      Manhattan           408          194144       475.843137   \n",
       "3         Queens           421          305660       726.033254   \n",
       "4  Staten Island            83           63824       768.963855   \n",
       "\n",
       "   Smallest School  Largest School  \n",
       "0                7            2936  \n",
       "1               46            5921  \n",
       "2               14            3342  \n",
       "3               42            4424  \n",
       "4              144            3726  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "boros = df[df.ay == 2020]\n",
    "boros = boros[[\"boro\", \"dbn\", \"total_enrollment\"]]\n",
    "table = boros.groupby(by=\"boro\").agg({\n",
    "    \"dbn\":\"count\", \n",
    "    \"total_enrollment\":[\"sum\", \"mean\", \"min\", \"max\"]}).reset_index()\n",
    "\n",
    "cols = [\"Borough\", \"Num. Schools\", \"Total Students\", \"Avg School Size\", \"Smallest School\", \"Largest School\"]\n",
    "table.columns = cols\n",
    "table"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8622f9b3",
   "metadata": {},
   "source": [
    "Grouping by Multiple Columns\n",
    "--------------------------------------------\n",
    "Sometimes we might want to pass in more than one column as the group by index. In this example we will group our data by borough and whether they are a charter school or not.\n",
    "\n",
    "Note a couple of things in this short example:\n",
    "\n",
    "1. We call `copy()` on `df` because we are adding a new column and don't want to mess with the original data frame\n",
    "2. We use a lambda function in apply, and we use a special type of if/else statement usually called a \"ternary\" operation\n",
    "3. Because we're doing several agg functions, we will create them as a dict _before_ we call `agg()`. This is just to make our code easier to read and maintain\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "8ae475e3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>boro</th>\n",
       "      <th>school type</th>\n",
       "      <th>dbn</th>\n",
       "      <th>asian_pct</th>\n",
       "      <th>black_pct</th>\n",
       "      <th>hispanic_pct</th>\n",
       "      <th>white_pct</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bronx</td>\n",
       "      <td>charter</td>\n",
       "      <td>89</td>\n",
       "      <td>0.022963</td>\n",
       "      <td>0.377873</td>\n",
       "      <td>0.569022</td>\n",
       "      <td>0.011857</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Bronx</td>\n",
       "      <td>public</td>\n",
       "      <td>407</td>\n",
       "      <td>0.041388</td>\n",
       "      <td>0.257767</td>\n",
       "      <td>0.645111</td>\n",
       "      <td>0.039413</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Brooklyn</td>\n",
       "      <td>charter</td>\n",
       "      <td>102</td>\n",
       "      <td>0.029157</td>\n",
       "      <td>0.600608</td>\n",
       "      <td>0.268745</td>\n",
       "      <td>0.071765</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Brooklyn</td>\n",
       "      <td>public</td>\n",
       "      <td>538</td>\n",
       "      <td>0.121413</td>\n",
       "      <td>0.408924</td>\n",
       "      <td>0.306846</td>\n",
       "      <td>0.137351</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Manhattan</td>\n",
       "      <td>charter</td>\n",
       "      <td>57</td>\n",
       "      <td>0.026281</td>\n",
       "      <td>0.445404</td>\n",
       "      <td>0.463035</td>\n",
       "      <td>0.037228</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Manhattan</td>\n",
       "      <td>public</td>\n",
       "      <td>351</td>\n",
       "      <td>0.106721</td>\n",
       "      <td>0.208111</td>\n",
       "      <td>0.467208</td>\n",
       "      <td>0.176647</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Queens</td>\n",
       "      <td>charter</td>\n",
       "      <td>28</td>\n",
       "      <td>0.092321</td>\n",
       "      <td>0.402000</td>\n",
       "      <td>0.406929</td>\n",
       "      <td>0.066893</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Queens</td>\n",
       "      <td>public</td>\n",
       "      <td>393</td>\n",
       "      <td>0.262712</td>\n",
       "      <td>0.197873</td>\n",
       "      <td>0.380275</td>\n",
       "      <td>0.119517</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Staten Island</td>\n",
       "      <td>charter</td>\n",
       "      <td>6</td>\n",
       "      <td>0.021333</td>\n",
       "      <td>0.258500</td>\n",
       "      <td>0.437667</td>\n",
       "      <td>0.237333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Staten Island</td>\n",
       "      <td>public</td>\n",
       "      <td>77</td>\n",
       "      <td>0.116519</td>\n",
       "      <td>0.142909</td>\n",
       "      <td>0.319675</td>\n",
       "      <td>0.392117</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            boro school type  dbn  asian_pct  black_pct  hispanic_pct  \\\n",
       "0          Bronx     charter   89   0.022963   0.377873      0.569022   \n",
       "1          Bronx      public  407   0.041388   0.257767      0.645111   \n",
       "2       Brooklyn     charter  102   0.029157   0.600608      0.268745   \n",
       "3       Brooklyn      public  538   0.121413   0.408924      0.306846   \n",
       "4      Manhattan     charter   57   0.026281   0.445404      0.463035   \n",
       "5      Manhattan      public  351   0.106721   0.208111      0.467208   \n",
       "6         Queens     charter   28   0.092321   0.402000      0.406929   \n",
       "7         Queens      public  393   0.262712   0.197873      0.380275   \n",
       "8  Staten Island     charter    6   0.021333   0.258500      0.437667   \n",
       "9  Staten Island      public   77   0.116519   0.142909      0.319675   \n",
       "\n",
       "   white_pct  \n",
       "0   0.011857  \n",
       "1   0.039413  \n",
       "2   0.071765  \n",
       "3   0.137351  \n",
       "4   0.037228  \n",
       "5   0.176647  \n",
       "6   0.066893  \n",
       "7   0.119517  \n",
       "8   0.237333  \n",
       "9   0.392117  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# because we're adding data to our subset, we're going to call copy()\n",
    "data = df[df.ay == 2020].copy()\n",
    "data[\"school type\"] = data.district.apply(lambda x: \"charter\" if x == 84 else \"public\")\n",
    "data = data[[\"boro\",\"school type\",\"dbn\",\"asian_pct\",\"black_pct\",\"hispanic_pct\",\"white_pct\"]]\n",
    "aggs = {\n",
    "    \"dbn\": \"count\",\n",
    "    \"asian_pct\": \"mean\",\n",
    "    \"black_pct\": \"mean\",\n",
    "    \"hispanic_pct\": \"mean\",\n",
    "    \"white_pct\": \"mean\"\n",
    "}\n",
    "\n",
    "\n",
    "data.groupby([\"boro\", \"school type\"]).agg(aggs).reset_index()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.10.6 ('school-data')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  },
  "vscode": {
   "interpreter": {
    "hash": "c853444e20c489e5b96d8e1a4533affead1d94f1ba40ff9ef08cffb9c8ee794e"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}