Merges & Joins#

Here, finally, we combine data sets together. I say finally because this is one of the first things you will do when working with real world data. nycschools has already combined multiple data sets into its core DataFrames, but you will want to combine this data in new ways, make new DataFrames from your results, and pull in new data from the outside world.

import pandas as pd
from nycschools import schools, exams

# school demographics in one frame
demo = schools.load_school_demographics()

# exam results in another frame
ela = exams.load_ela()
math = exams.load_math()

Merging Data#

When two data sets have a shared key then combining them into a single dataframe is straightforward. Here we use the merge() function in DataFrame to combine the school demographcs and ela test results into a single data frame. We merge them “on” the dbn and ay columns because these cols represent a unique identifier for each row in both data sets. Note that the result has only the intersecton of both datasets. If our demographic data has a DBN that’s not in the test data set, that school’s data will be dropped from the results.

demo.merge(ela, on=["dbn", "ay"]).head()
dbn beds district geo_district boro school_name_x short_name ay year total_enrollment ... level_2_pct level_3_n level_3_pct level_4_n level_4_pct level_3_4_n level_3_4_pct test_year charter school_name_y
0 01M015 310100010015 1 1 Manhattan P.S. 015 Roberto Clemente PS 15 2016 2016-17 178 ... 0.296296 7.0 0.259259 2.0 0.074074 9.0 0.333333 2017 0 NaN
1 01M015 310100010015 1 1 Manhattan P.S. 015 Roberto Clemente PS 15 2016 2016-17 178 ... 0.521739 9.0 0.391304 1.0 0.043478 10.0 0.434783 2017 0 NaN
2 01M015 310100010015 1 1 Manhattan P.S. 015 Roberto Clemente PS 15 2016 2016-17 178 ... 0.647059 5.0 0.294118 0.0 0.000000 5.0 0.294118 2017 0 NaN
3 01M015 310100010015 1 1 Manhattan P.S. 015 Roberto Clemente PS 15 2016 2016-17 178 ... 0.462687 21.0 0.313433 3.0 0.044776 24.0 0.358209 2017 0 NaN
4 01M015 310100010015 1 1 Manhattan P.S. 015 Roberto Clemente PS 15 2016 2016-17 178 ... 0.333333 7.0 0.333333 2.0 0.095238 9.0 0.428571 2017 0 NaN

5 rows × 68 columns

Astute readers will see a new column called school_name_x in our results. You can’t see it in the truncated data above, but there is also a school_name_y. Both data sets had a school_name column, so pandas suffixed the one from demo with x and the one from ela with y.

Because both columns should contain the same data (because they have the save DBN for the same school year), in this case we can just drop one of the columns before the merge. We’ll drop school_name from ela because the demographics data set should have our canoninical school names.

ela2 = ela.drop(columns="school_name")
demo.merge(ela2, on=["dbn", "ay"]).head()
dbn beds district geo_district boro school_name short_name ay year total_enrollment ... level_2_n level_2_pct level_3_n level_3_pct level_4_n level_4_pct level_3_4_n level_3_4_pct test_year charter
0 01M015 310100010015 1 1 Manhattan P.S. 015 Roberto Clemente PS 15 2016 2016-17 178 ... 8.0 0.296296 7.0 0.259259 2.0 0.074074 9.0 0.333333 2017 0
1 01M015 310100010015 1 1 Manhattan P.S. 015 Roberto Clemente PS 15 2016 2016-17 178 ... 12.0 0.521739 9.0 0.391304 1.0 0.043478 10.0 0.434783 2017 0
2 01M015 310100010015 1 1 Manhattan P.S. 015 Roberto Clemente PS 15 2016 2016-17 178 ... 11.0 0.647059 5.0 0.294118 0.0 0.000000 5.0 0.294118 2017 0
3 01M015 310100010015 1 1 Manhattan P.S. 015 Roberto Clemente PS 15 2016 2016-17 178 ... 31.0 0.462687 21.0 0.313433 3.0 0.044776 24.0 0.358209 2017 0
4 01M015 310100010015 1 1 Manhattan P.S. 015 Roberto Clemente PS 15 2016 2016-17 178 ... 7.0 0.333333 7.0 0.333333 2.0 0.095238 9.0 0.428571 2017 0

5 rows × 67 columns

Merge with suffixes: wide data#

When we have a data set where we use columns to represent different categories of data, we can call this “wide data.” In this example, we’re going to make a test results data frame that has math and ela test scores in the same row.

At the start of this notebook we loaded both the math and ela exam data. Note that they have exactly the same columns. When we merge them, we’re going to suffix the ela data with _ela and the math data with _math. We will drop school_name from one of the data sets and use ["dbn", "ay", "test_year", "grade", "category"] to join the two.

ela2 = ela.drop(columns="school_name")
wide = ela2.merge(math, on=["dbn", "ay", "test_year", "grade", "category"], suffixes=["_ela", "_math"])
wide.head()
dbn grade category number_tested_ela mean_scale_score_ela level_1_n_ela level_1_pct_ela level_2_n_ela level_2_pct_ela level_3_n_ela ... level_2_n_math level_2_pct_math level_3_n_math level_3_pct_math level_4_n_math level_4_pct_math level_3_4_n_math level_3_4_pct_math charter_math school_name
0 01M015 3 All Students 27 289.296295 14.0 0.518519 11.0 0.407407 2.0 ... 11.0 0.407407 0.0 0.000000 0.0 0.000000 0.0 0.000000 0 NaN
1 01M015 4 All Students 20 277.649994 8.0 0.400000 11.0 0.550000 1.0 ... 6.0 0.300000 1.0 0.050000 1.0 0.050000 2.0 0.100000 0 NaN
2 01M015 5 All Students 24 283.958344 12.0 0.500000 11.0 0.458333 1.0 ... 6.0 0.250000 1.0 0.041667 0.0 0.000000 1.0 0.041667 0 NaN
3 01M015 All Grades All Students 71 284.211273 34.0 0.478873 33.0 0.464789 4.0 ... 23.0 0.323944 2.0 0.028169 1.0 0.014085 3.0 0.042254 0 NaN
4 01M015 3 Not SWD 19 287.157898 11.0 0.578947 8.0 0.421053 0.0 ... 8.0 0.421053 0.0 0.000000 0.0 0.000000 0.0 0.000000 0 NaN

5 rows × 32 columns

Now that we have the wide data, we can easily compare math and ela results for the same set school/year/grade/category. Below we will calculate a new column called test_delta that shows the difference between math and ela results for the same cohort of students in the same school. This will let us see, for example, which if any schools are “unbalanced” in math and ELA. Later, we might correlate this to other demographic data such as high percentages of ELL or SWD students.

wide["test_delta"] = wide.mean_scale_score_math - wide.mean_scale_score_ela
wide["test_delta_abs"] = abs(wide.mean_scale_score_math - wide.mean_scale_score_ela)

wide = wide.sort_values(by="test_delta_abs", ascending=False)

wide[["dbn","grade","ay","category","mean_scale_score_math", "mean_scale_score_ela", "test_delta"]].head()
dbn grade ay category mean_scale_score_math mean_scale_score_ela test_delta
28254 03M054 8 2014 Never ELL 243.764710 331.356811 -87.592102
82041 07X500 8 2014 Female 210.500000 296.057129 -85.557129
28230 03M054 8 2014 Female 244.666672 329.838715 -85.172043
82020 07X500 8 2014 Not SWD 222.750000 301.250000 -78.500000
240736 20K105 5 2015 Current ELL 330.679260 252.888885 77.790375

Concatenate DataFrames: long data#

If we have two DataFrames with the same columns (like our math and ela results) we can concatenate the data from one df onto the other to create a longer dataframe. Generally, this is a more flexible format for analysis than the wide data format. In our math/ela example, it allows us to, for example, easily average the math and ela test scores together (filtering for group, year, etc).

In the example here we will add a new column, test, which will have the value of “ela” or “math”. This column will let us know whether that row reports a math test or ela test result.

math["test"] = "math"
ela["test"] = "ela"
long = pd.concat([math, ela])
long
dbn grade category number_tested mean_scale_score level_1_n level_1_pct level_2_n level_2_pct level_3_n level_3_pct level_4_n level_4_pct level_3_4_n level_3_4_pct test_year ay charter school_name test
0 01M015 3 All Students 27 277.777771 16.0 0.592593 11.0 0.407407 0.0 0.000000 0.0 0.000000 0.0 0.000000 2013 2012 0 NaN math
7 01M015 4 All Students 20 277.399994 12.0 0.600000 6.0 0.300000 1.0 0.050000 1.0 0.050000 2.0 0.100000 2013 2012 0 NaN math
14 01M015 5 All Students 24 274.000000 17.0 0.708333 6.0 0.250000 1.0 0.041667 0.0 0.000000 1.0 0.041667 2013 2012 0 NaN math
21 01M015 All Grades All Students 71 276.394379 45.0 0.633803 23.0 0.323944 2.0 0.028169 1.0 0.014085 3.0 0.042254 2013 2012 0 NaN math
32606 01M015 3 Not SWD 19 275.736847 11.0 0.578947 8.0 0.421053 0.0 0.000000 0.0 0.000000 0.0 0.000000 2013 2012 0 NaN math
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
430002 84X730 3 All Students 51 602.000000 2.0 0.039000 22.0 0.431000 23.0 0.451000 4.0 0.078000 27.0 0.529000 2019 2018 1 BRONX CHARTER SCHOOL FOR THE ARTS ela
430009 84X730 4 All Students 49 591.000000 18.0 0.367000 14.0 0.286000 13.0 0.265000 4.0 0.082000 17.0 0.347000 2019 2018 1 BRONX CHARTER SCHOOL FOR THE ARTS ela
430016 84X730 5 All Students 53 597.000000 18.0 0.340000 20.0 0.377000 8.0 0.151000 7.0 0.132000 15.0 0.283000 2019 2018 1 BRONX CHARTER SCHOOL FOR THE ARTS ela
430017 84X730 6 All Students 113 592.000000 45.0 0.398000 33.0 0.292000 23.0 0.204000 12.0 0.106000 35.0 0.310000 2019 2018 1 BRONX CHARTER SCHOOL FOR THE ARTS ela
430024 84X730 All Grades All Students 266 594.000000 83.0 0.312000 89.0 0.335000 67.0 0.252000 27.0 0.102000 94.0 0.353000 2019 2018 1 BRONX CHARTER SCHOOL FOR THE ARTS ela

856567 rows × 20 columns