{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "9060da5d", "metadata": {}, "source": [ "Groups & Aggregates\n", "===================\n", "\n", "In this notebook we look at how to change the level that we use to analyze data by **grouping** multiple\n", "rows into a single row. This `groupby` function allows us to look at the school data by district,\n", "borough, or academic year. To group data, you specifiy the column name or column names for the group. All unique names or combinations will become a single row.\n", "\n", "The _other_ rows of data need to be **aggregated** in some way. Common aggregate functions include `sum` (e.g. find the total enrolled students in Brooklyn), `count` (find unique schools in Queens), and `mean` (find the average percent poverty of students in the Bronx)." ] }, { "cell_type": "code", "execution_count": 1, "id": "26694bf0", "metadata": {}, "outputs": [], "source": [ "from nycschools import schools\n", "\n", "df = schools.load_school_demographics()" ] }, { "cell_type": "markdown", "id": "21e2d414", "metadata": {}, "source": [ "Total Enrollment by Boro\n", "------------------------------------\n", "For this example, we create a simplified data set that only has 2 columns:\n", "`boro` and `total_enrollment`. We also limit it to only the 2020 academic year.\n", "\n", "We call the `groupby()` function of `DataFrame`and pass in \"boro\" as the column name for the group.\n", "From there, we call `.sum()` on the resulting GroupBy object. This will sum up the enrollment in each school,\n", "in each borough.\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "b4706c22", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_enrollment
boro
Bronx238074
Brooklyn346802
Manhattan194144
Queens305660
Staten Island63824
\n", "
" ], "text/plain": [ " total_enrollment\n", "boro \n", "Bronx 238074\n", "Brooklyn 346802\n", "Manhattan 194144\n", "Queens 305660\n", "Staten Island 63824" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "boros = df[df.ay == 2020]\n", "boros = boros[[\"boro\", \"total_enrollment\"]]\n", "\n", "boros.groupby(\"boro\").sum()\n" ] }, { "cell_type": "markdown", "id": "bb86c5f0", "metadata": {}, "source": [ "We see in th results that Brooklyn had the most students enrolled in 2020 (n=346802) and Staten Island\n", "had the fewest (n=63824).\n", "\n", "We also notice that `total_enrolmment` doesn't appear on the same level as `boro`. If we want to be able to treat `total_enrollment` as a \"normal\" column, we need to call `reset_index()` after our aggregate function.\n", "\n", "Like this:" ] }, { "cell_type": "code", "execution_count": 3, "id": "c8c9747b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
borototal_enrollment
0Bronx238074
1Brooklyn346802
2Manhattan194144
3Queens305660
4Staten Island63824
\n", "
" ], "text/plain": [ " boro total_enrollment\n", "0 Bronx 238074\n", "1 Brooklyn 346802\n", "2 Manhattan 194144\n", "3 Queens 305660\n", "4 Staten Island 63824" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "boros.groupby(\"boro\").sum().reset_index()" ] }, { "cell_type": "markdown", "id": "195ac615", "metadata": {}, "source": [ "Using a `dict` of aggregate columns\n", "----------------------------------------------------\n", "Now, we might have a few more questions about this simple measure of school enrollment. For example, what was the average enrollment, and how many different schools are in each borough?\n", "\n", "We can answer these questions by running by adding more columns (namely `dbn`) to our data and using different aggregate functions. In this case we will use `sum`, `count`, `mean`, `min`, and `max`.\n", "\n", "To accomplish this, we call the `agg` function on the results of the groupby and pass in a dictionary\n", "where the column names are keys, and the aggregate function(s) are the values. If we want only a single agg \n", "function, we can use that as the value, but if we want multiple functions, we can set a list of strings\n", "as the value." ] }, { "cell_type": "code", "execution_count": 4, "id": "5bef5111", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dbntotal_enrollment
countsummeanminmax
boro
Bronx496238074479.98790372936
Brooklyn640346802541.878125465921
Manhattan408194144475.843137143342
Queens421305660726.033254424424
Staten Island8363824768.9638551443726
\n", "
" ], "text/plain": [ " dbn total_enrollment \n", " count sum mean min max\n", "boro \n", "Bronx 496 238074 479.987903 7 2936\n", "Brooklyn 640 346802 541.878125 46 5921\n", "Manhattan 408 194144 475.843137 14 3342\n", "Queens 421 305660 726.033254 42 4424\n", "Staten Island 83 63824 768.963855 144 3726" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "boros = df[df.ay == 2020]\n", "boros = boros[[\"boro\", \"dbn\", \"total_enrollment\"]]\n", "boros = boros.groupby(\"boro\").agg({\n", " \"dbn\":\"count\", \n", " \"total_enrollment\":[\"sum\", \"mean\", \"min\", \"max\"]})\n", "boros" ] }, { "cell_type": "markdown", "id": "d4e8caf5", "metadata": {}, "source": [ "Resetting the index\n", "----------------------------\n", "In the results above, we say that the aggregate columns do not appear in the same row as the grouped column, boro. Also, there are 2 levels to the aggregate index because we have 4 columns for total_enrollment. Sometimes we want to convert these columnbs back into a flat index. This makes it easier to filter, merge, and sort the data and to access specific Series in the DataFrame. Unfortunately, we can't simply call `reset_index()` like we did above.\n", "\n", "Here after we reset the index, we rename all of the columns with a new list of strings. The length of the new list musth match the number of columns." ] }, { "cell_type": "code", "execution_count": 5, "id": "45127262", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BoroughNum. SchoolsTotal StudentsAvg School SizeSmallest SchoolLargest School
0Bronx496238074479.98790372936
1Brooklyn640346802541.878125465921
2Manhattan408194144475.843137143342
3Queens421305660726.033254424424
4Staten Island8363824768.9638551443726
\n", "
" ], "text/plain": [ " Borough Num. Schools Total Students Avg School Size \\\n", "0 Bronx 496 238074 479.987903 \n", "1 Brooklyn 640 346802 541.878125 \n", "2 Manhattan 408 194144 475.843137 \n", "3 Queens 421 305660 726.033254 \n", "4 Staten Island 83 63824 768.963855 \n", "\n", " Smallest School Largest School \n", "0 7 2936 \n", "1 46 5921 \n", "2 14 3342 \n", "3 42 4424 \n", "4 144 3726 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "boros = df[df.ay == 2020]\n", "boros = boros[[\"boro\", \"dbn\", \"total_enrollment\"]]\n", "table = boros.groupby(by=\"boro\").agg({\n", " \"dbn\":\"count\", \n", " \"total_enrollment\":[\"sum\", \"mean\", \"min\", \"max\"]}).reset_index()\n", "\n", "cols = [\"Borough\", \"Num. Schools\", \"Total Students\", \"Avg School Size\", \"Smallest School\", \"Largest School\"]\n", "table.columns = cols\n", "table" ] }, { "cell_type": "markdown", "id": "8622f9b3", "metadata": {}, "source": [ "Grouping by Multiple Columns\n", "--------------------------------------------\n", "Sometimes we might want to pass in more than one column as the group by index. In this example we will group our data by borough and whether they are a charter school or not.\n", "\n", "Note a couple of things in this short example:\n", "\n", "1. We call `copy()` on `df` because we are adding a new column and don't want to mess with the original data frame\n", "2. We use a lambda function in apply, and we use a special type of if/else statement usually called a \"ternary\" operation\n", "3. Because we're doing several agg functions, we will create them as a dict _before_ we call `agg()`. This is just to make our code easier to read and maintain\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "8ae475e3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
boroschool typedbnasian_pctblack_pcthispanic_pctwhite_pct
0Bronxcharter890.0229630.3778730.5690220.011857
1Bronxpublic4070.0413880.2577670.6451110.039413
2Brooklyncharter1020.0291570.6006080.2687450.071765
3Brooklynpublic5380.1214130.4089240.3068460.137351
4Manhattancharter570.0262810.4454040.4630350.037228
5Manhattanpublic3510.1067210.2081110.4672080.176647
6Queenscharter280.0923210.4020000.4069290.066893
7Queenspublic3930.2627120.1978730.3802750.119517
8Staten Islandcharter60.0213330.2585000.4376670.237333
9Staten Islandpublic770.1165190.1429090.3196750.392117
\n", "
" ], "text/plain": [ " boro school type dbn asian_pct black_pct hispanic_pct \\\n", "0 Bronx charter 89 0.022963 0.377873 0.569022 \n", "1 Bronx public 407 0.041388 0.257767 0.645111 \n", "2 Brooklyn charter 102 0.029157 0.600608 0.268745 \n", "3 Brooklyn public 538 0.121413 0.408924 0.306846 \n", "4 Manhattan charter 57 0.026281 0.445404 0.463035 \n", "5 Manhattan public 351 0.106721 0.208111 0.467208 \n", "6 Queens charter 28 0.092321 0.402000 0.406929 \n", "7 Queens public 393 0.262712 0.197873 0.380275 \n", "8 Staten Island charter 6 0.021333 0.258500 0.437667 \n", "9 Staten Island public 77 0.116519 0.142909 0.319675 \n", "\n", " white_pct \n", "0 0.011857 \n", "1 0.039413 \n", "2 0.071765 \n", "3 0.137351 \n", "4 0.037228 \n", "5 0.176647 \n", "6 0.066893 \n", "7 0.119517 \n", "8 0.237333 \n", "9 0.392117 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# because we're adding data to our subset, we're going to call copy()\n", "data = df[df.ay == 2020].copy()\n", "data[\"school type\"] = data.district.apply(lambda x: \"charter\" if x == 84 else \"public\")\n", "data = data[[\"boro\",\"school type\",\"dbn\",\"asian_pct\",\"black_pct\",\"hispanic_pct\",\"white_pct\"]]\n", "aggs = {\n", " \"dbn\": \"count\",\n", " \"asian_pct\": \"mean\",\n", " \"black_pct\": \"mean\",\n", " \"hispanic_pct\": \"mean\",\n", " \"white_pct\": \"mean\"\n", "}\n", "\n", "\n", "data.groupby([\"boro\", \"school type\"]).agg(aggs).reset_index()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.10.6 ('school-data')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "vscode": { "interpreter": { "hash": "c853444e20c489e5b96d8e1a4533affead1d94f1ba40ff9ef08cffb9c8ee794e" } } }, "nbformat": 4, "nbformat_minor": 5 }