Groups & Aggregates
Contents
Groups & Aggregates#
In this notebook we look at how to change the level that we use to analyze data by grouping multiple
rows into a single row. This groupby
function allows us to look at the school data by district,
borough, or academic year. To group data, you specifiy the column name or column names for the group. All unique names or combinations will become a single row.
The other rows of data need to be aggregated in some way. Common aggregate functions include sum
(e.g. find the total enrolled students in Brooklyn), count
(find unique schools in Queens), and mean
(find the average percent poverty of students in the Bronx).
from nycschools import schools
df = schools.load_school_demographics()
Total Enrollment by Boro#
For this example, we create a simplified data set that only has 2 columns:
boro
and total_enrollment
. We also limit it to only the 2020 academic year.
We call the groupby()
function of DataFrame
and pass in “boro” as the column name for the group.
From there, we call .sum()
on the resulting GroupBy object. This will sum up the enrollment in each school,
in each borough.
boros = df[df.ay == 2020]
boros = boros[["boro", "total_enrollment"]]
boros.groupby("boro").sum()
total_enrollment | |
---|---|
boro | |
Bronx | 238074 |
Brooklyn | 346802 |
Manhattan | 194144 |
Queens | 305660 |
Staten Island | 63824 |
We see in th results that Brooklyn had the most students enrolled in 2020 (n=346802) and Staten Island had the fewest (n=63824).
We also notice that total_enrolmment
doesn’t appear on the same level as boro
. If we want to be able to treat total_enrollment
as a “normal” column, we need to call reset_index()
after our aggregate function.
Like this:
boros.groupby("boro").sum().reset_index()
boro | total_enrollment | |
---|---|---|
0 | Bronx | 238074 |
1 | Brooklyn | 346802 |
2 | Manhattan | 194144 |
3 | Queens | 305660 |
4 | Staten Island | 63824 |
Using a dict
of aggregate columns#
Now, we might have a few more questions about this simple measure of school enrollment. For example, what was the average enrollment, and how many different schools are in each borough?
We can answer these questions by running by adding more columns (namely dbn
) to our data and using different aggregate functions. In this case we will use sum
, count
, mean
, min
, and max
.
To accomplish this, we call the agg
function on the results of the groupby and pass in a dictionary
where the column names are keys, and the aggregate function(s) are the values. If we want only a single agg
function, we can use that as the value, but if we want multiple functions, we can set a list of strings
as the value.
boros = df[df.ay == 2020]
boros = boros[["boro", "dbn", "total_enrollment"]]
boros = boros.groupby("boro").agg({
"dbn":"count",
"total_enrollment":["sum", "mean", "min", "max"]})
boros
dbn | total_enrollment | ||||
---|---|---|---|---|---|
count | sum | mean | min | max | |
boro | |||||
Bronx | 496 | 238074 | 479.987903 | 7 | 2936 |
Brooklyn | 640 | 346802 | 541.878125 | 46 | 5921 |
Manhattan | 408 | 194144 | 475.843137 | 14 | 3342 |
Queens | 421 | 305660 | 726.033254 | 42 | 4424 |
Staten Island | 83 | 63824 | 768.963855 | 144 | 3726 |
Resetting the index#
In the results above, we say that the aggregate columns do not appear in the same row as the grouped column, boro. Also, there are 2 levels to the aggregate index because we have 4 columns for total_enrollment. Sometimes we want to convert these columnbs back into a flat index. This makes it easier to filter, merge, and sort the data and to access specific Series in the DataFrame. Unfortunately, we can’t simply call reset_index()
like we did above.
Here after we reset the index, we rename all of the columns with a new list of strings. The length of the new list musth match the number of columns.
boros = df[df.ay == 2020]
boros = boros[["boro", "dbn", "total_enrollment"]]
table = boros.groupby(by="boro").agg({
"dbn":"count",
"total_enrollment":["sum", "mean", "min", "max"]}).reset_index()
cols = ["Borough", "Num. Schools", "Total Students", "Avg School Size", "Smallest School", "Largest School"]
table.columns = cols
table
Borough | Num. Schools | Total Students | Avg School Size | Smallest School | Largest School | |
---|---|---|---|---|---|---|
0 | Bronx | 496 | 238074 | 479.987903 | 7 | 2936 |
1 | Brooklyn | 640 | 346802 | 541.878125 | 46 | 5921 |
2 | Manhattan | 408 | 194144 | 475.843137 | 14 | 3342 |
3 | Queens | 421 | 305660 | 726.033254 | 42 | 4424 |
4 | Staten Island | 83 | 63824 | 768.963855 | 144 | 3726 |
Grouping by Multiple Columns#
Sometimes we might want to pass in more than one column as the group by index. In this example we will group our data by borough and whether they are a charter school or not.
Note a couple of things in this short example:
We call
copy()
ondf
because we are adding a new column and don’t want to mess with the original data frameWe use a lambda function in apply, and we use a special type of if/else statement usually called a “ternary” operation
Because we’re doing several agg functions, we will create them as a dict before we call
agg()
. This is just to make our code easier to read and maintain
# because we're adding data to our subset, we're going to call copy()
data = df[df.ay == 2020].copy()
data["school type"] = data.district.apply(lambda x: "charter" if x == 84 else "public")
data = data[["boro","school type","dbn","asian_pct","black_pct","hispanic_pct","white_pct"]]
aggs = {
"dbn": "count",
"asian_pct": "mean",
"black_pct": "mean",
"hispanic_pct": "mean",
"white_pct": "mean"
}
data.groupby(["boro", "school type"]).agg(aggs).reset_index()
boro | school type | dbn | asian_pct | black_pct | hispanic_pct | white_pct | |
---|---|---|---|---|---|---|---|
0 | Bronx | charter | 89 | 0.022963 | 0.377873 | 0.569022 | 0.011857 |
1 | Bronx | public | 407 | 0.041388 | 0.257767 | 0.645111 | 0.039413 |
2 | Brooklyn | charter | 102 | 0.029157 | 0.600608 | 0.268745 | 0.071765 |
3 | Brooklyn | public | 538 | 0.121413 | 0.408924 | 0.306846 | 0.137351 |
4 | Manhattan | charter | 57 | 0.026281 | 0.445404 | 0.463035 | 0.037228 |
5 | Manhattan | public | 351 | 0.106721 | 0.208111 | 0.467208 | 0.176647 |
6 | Queens | charter | 28 | 0.092321 | 0.402000 | 0.406929 | 0.066893 |
7 | Queens | public | 393 | 0.262712 | 0.197873 | 0.380275 | 0.119517 |
8 | Staten Island | charter | 6 | 0.021333 | 0.258500 | 0.437667 | 0.237333 |
9 | Staten Island | public | 77 | 0.116519 | 0.142909 | 0.319675 | 0.392117 |