Groups & Aggregates#

In this notebook we look at how to change the level that we use to analyze data by grouping multiple rows into a single row. This groupby function allows us to look at the school data by district, borough, or academic year. To group data, you specifiy the column name or column names for the group. All unique names or combinations will become a single row.

The other rows of data need to be aggregated in some way. Common aggregate functions include sum (e.g. find the total enrolled students in Brooklyn), count (find unique schools in Queens), and mean (find the average percent poverty of students in the Bronx).

from nycschools import schools

df = schools.load_school_demographics()

Total Enrollment by Boro#

For this example, we create a simplified data set that only has 2 columns: boro and total_enrollment. We also limit it to only the 2020 academic year.

We call the groupby() function of DataFrameand pass in “boro” as the column name for the group. From there, we call .sum() on the resulting GroupBy object. This will sum up the enrollment in each school, in each borough.

boros = df[df.ay == 2020]
boros = boros[["boro", "total_enrollment"]]

boros.groupby("boro").sum()
total_enrollment
boro
Bronx 238074
Brooklyn 346802
Manhattan 194144
Queens 305660
Staten Island 63824

We see in th results that Brooklyn had the most students enrolled in 2020 (n=346802) and Staten Island had the fewest (n=63824).

We also notice that total_enrolmment doesn’t appear on the same level as boro. If we want to be able to treat total_enrollment as a “normal” column, we need to call reset_index() after our aggregate function.

Like this:

boros.groupby("boro").sum().reset_index()
boro total_enrollment
0 Bronx 238074
1 Brooklyn 346802
2 Manhattan 194144
3 Queens 305660
4 Staten Island 63824

Using a dict of aggregate columns#

Now, we might have a few more questions about this simple measure of school enrollment. For example, what was the average enrollment, and how many different schools are in each borough?

We can answer these questions by running by adding more columns (namely dbn) to our data and using different aggregate functions. In this case we will use sum, count, mean, min, and max.

To accomplish this, we call the agg function on the results of the groupby and pass in a dictionary where the column names are keys, and the aggregate function(s) are the values. If we want only a single agg function, we can use that as the value, but if we want multiple functions, we can set a list of strings as the value.

boros = df[df.ay == 2020]
boros = boros[["boro", "dbn", "total_enrollment"]]
boros = boros.groupby("boro").agg({
    "dbn":"count", 
    "total_enrollment":["sum", "mean", "min", "max"]})
boros
dbn total_enrollment
count sum mean min max
boro
Bronx 496 238074 479.987903 7 2936
Brooklyn 640 346802 541.878125 46 5921
Manhattan 408 194144 475.843137 14 3342
Queens 421 305660 726.033254 42 4424
Staten Island 83 63824 768.963855 144 3726

Resetting the index#

In the results above, we say that the aggregate columns do not appear in the same row as the grouped column, boro. Also, there are 2 levels to the aggregate index because we have 4 columns for total_enrollment. Sometimes we want to convert these columnbs back into a flat index. This makes it easier to filter, merge, and sort the data and to access specific Series in the DataFrame. Unfortunately, we can’t simply call reset_index() like we did above.

Here after we reset the index, we rename all of the columns with a new list of strings. The length of the new list musth match the number of columns.

boros = df[df.ay == 2020]
boros = boros[["boro", "dbn", "total_enrollment"]]
table = boros.groupby(by="boro").agg({
    "dbn":"count", 
    "total_enrollment":["sum", "mean", "min", "max"]}).reset_index()

cols = ["Borough", "Num. Schools", "Total Students", "Avg School Size", "Smallest School", "Largest School"]
table.columns = cols
table
Borough Num. Schools Total Students Avg School Size Smallest School Largest School
0 Bronx 496 238074 479.987903 7 2936
1 Brooklyn 640 346802 541.878125 46 5921
2 Manhattan 408 194144 475.843137 14 3342
3 Queens 421 305660 726.033254 42 4424
4 Staten Island 83 63824 768.963855 144 3726

Grouping by Multiple Columns#

Sometimes we might want to pass in more than one column as the group by index. In this example we will group our data by borough and whether they are a charter school or not.

Note a couple of things in this short example:

  1. We call copy() on df because we are adding a new column and don’t want to mess with the original data frame

  2. We use a lambda function in apply, and we use a special type of if/else statement usually called a “ternary” operation

  3. Because we’re doing several agg functions, we will create them as a dict before we call agg(). This is just to make our code easier to read and maintain

# because we're adding data to our subset, we're going to call copy()
data = df[df.ay == 2020].copy()
data["school type"] = data.district.apply(lambda x: "charter" if x == 84 else "public")
data = data[["boro","school type","dbn","asian_pct","black_pct","hispanic_pct","white_pct"]]
aggs = {
    "dbn": "count",
    "asian_pct": "mean",
    "black_pct": "mean",
    "hispanic_pct": "mean",
    "white_pct": "mean"
}


data.groupby(["boro", "school type"]).agg(aggs).reset_index()
boro school type dbn asian_pct black_pct hispanic_pct white_pct
0 Bronx charter 89 0.022963 0.377873 0.569022 0.011857
1 Bronx public 407 0.041388 0.257767 0.645111 0.039413
2 Brooklyn charter 102 0.029157 0.600608 0.268745 0.071765
3 Brooklyn public 538 0.121413 0.408924 0.306846 0.137351
4 Manhattan charter 57 0.026281 0.445404 0.463035 0.037228
5 Manhattan public 351 0.106721 0.208111 0.467208 0.176647
6 Queens charter 28 0.092321 0.402000 0.406929 0.066893
7 Queens public 393 0.262712 0.197873 0.380275 0.119517
8 Staten Island charter 6 0.021333 0.258500 0.437667 0.237333
9 Staten Island public 77 0.116519 0.142909 0.319675 0.392117