Merges & Joins#

Here, finally, we combine data sets together. I say finally because this is one of the first things you will do when working with real world data. nycschools has already combined multiple data sets into its core DataFrames, but you will want to combine this data in new ways, make new DataFrames from your results, and pull in new data from the outside world.

import pandas as pd
from nycschools import schools, exams

# school demographics in one frame
demo = schools.load_school_demographics()

# exam results in another frame
ela = exams.load_ela()
math = exams.load_math()

Merging Data#

When two data sets have a shared key then combining them into a single dataframe is straightforward. Here we use the merge() function in DataFrame to combine the school demographcs and ela test results into a single data frame. We merge them “on” the dbn and ay columns because these cols represent a unique identifier for each row in both data sets. Note that the result has only the intersecton of both datasets. If our demographic data has a DBN that’s not in the test data set, that school’s data will be dropped from the results.

demo.merge(ela, on=["dbn", "ay"]).head()

	dbn	beds	district	geo_district	boro	school_name_x	short_name	ay	year	total_enrollment	...	level_2_pct	level_3_n	level_3_pct	level_4_n	level_4_pct	level_3_4_n	level_3_4_pct	test_year	school_name_y
0	01M015	310100010015	1	1	Manhattan	P.S. 015 Roberto Clemente	PS 15	2016	2016-17	178	...	0.296296	7.0	0.259259	2.0	0.074074	9.0	0.333333	2017	NaN
1	01M015	310100010015	1	1	Manhattan	P.S. 015 Roberto Clemente	PS 15	2016	2016-17	178	...	0.521739	9.0	0.391304	1.0	0.043478	10.0	0.434783	2017	NaN
2	01M015	310100010015	1	1	Manhattan	P.S. 015 Roberto Clemente	PS 15	2016	2016-17	178	...	0.647059	5.0	0.294118	0.0	0.000000	5.0	0.294118	2017	NaN
3	01M015	310100010015	1	1	Manhattan	P.S. 015 Roberto Clemente	PS 15	2016	2016-17	178	...	0.462687	21.0	0.313433	3.0	0.044776	24.0	0.358209	2017	NaN
4	01M015	310100010015	1	1	Manhattan	P.S. 015 Roberto Clemente	PS 15	2016	2016-17	178	...	0.333333	7.0	0.333333	2.0	0.095238	9.0	0.428571	2017	NaN

5 rows × 68 columns

Astute readers will see a new column called school_name_x in our results. You can’t see it in the truncated data above, but there is also a school_name_y. Both data sets had a school_name column, so pandas suffixed the one from demo with x and the one from ela with y.

Because both columns should contain the same data (because they have the save DBN for the same school year), in this case we can just drop one of the columns before the merge. We’ll drop school_name from ela because the demographics data set should have our canoninical school names.

ela2 = ela.drop(columns="school_name")
demo.merge(ela2, on=["dbn", "ay"]).head()

	dbn	beds	district	geo_district	boro	school_name	short_name	ay	year	total_enrollment	...	level_2_n	level_2_pct	level_3_n	level_3_pct	level_4_n	level_4_pct	level_3_4_n	level_3_4_pct	test_year
0	01M015	310100010015	1	1	Manhattan	P.S. 015 Roberto Clemente	PS 15	2016	2016-17	178	...	8.0	0.296296	7.0	0.259259	2.0	0.074074	9.0	0.333333	2017
1	01M015	310100010015	1	1	Manhattan	P.S. 015 Roberto Clemente	PS 15	2016	2016-17	178	...	12.0	0.521739	9.0	0.391304	1.0	0.043478	10.0	0.434783	2017
2	01M015	310100010015	1	1	Manhattan	P.S. 015 Roberto Clemente	PS 15	2016	2016-17	178	...	11.0	0.647059	5.0	0.294118	0.0	0.000000	5.0	0.294118	2017
3	01M015	310100010015	1	1	Manhattan	P.S. 015 Roberto Clemente	PS 15	2016	2016-17	178	...	31.0	0.462687	21.0	0.313433	3.0	0.044776	24.0	0.358209	2017
4	01M015	310100010015	1	1	Manhattan	P.S. 015 Roberto Clemente	PS 15	2016	2016-17	178	...	7.0	0.333333	7.0	0.333333	2.0	0.095238	9.0	0.428571	2017

5 rows × 67 columns

Merge with suffixes: wide data#

When we have a data set where we use columns to represent different categories of data, we can call this “wide data.” In this example, we’re going to make a test results data frame that has math and ela test scores in the same row.

At the start of this notebook we loaded both the math and ela exam data. Note that they have exactly the same columns. When we merge them, we’re going to suffix the ela data with _ela and the math data with _math. We will drop school_name from one of the data sets and use ["dbn", "ay", "test_year", "grade", "category"] to join the two.

ela2 = ela.drop(columns="school_name")
wide = ela2.merge(math, on=["dbn", "ay", "test_year", "grade", "category"], suffixes=["_ela", "_math"])
wide.head()

	dbn	grade	category	number_tested_ela	mean_scale_score_ela	level_1_n_ela	level_1_pct_ela	level_2_n_ela	level_2_pct_ela	level_3_n_ela	...	level_2_n_math	level_2_pct_math	level_3_n_math	level_3_pct_math	level_4_n_math	level_4_pct_math	level_3_4_n_math	level_3_4_pct_math	school_name
0	01M015	3	All Students	27	289.296295	14.0	0.518519	11.0	0.407407	2.0	...	11.0	0.407407	0.0	0.000000	0.0	0.000000	0.0	0.000000	NaN
1	01M015	4	All Students	20	277.649994	8.0	0.400000	11.0	0.550000	1.0	...	6.0	0.300000	1.0	0.050000	1.0	0.050000	2.0	0.100000	NaN
2	01M015	5	All Students	24	283.958344	12.0	0.500000	11.0	0.458333	1.0	...	6.0	0.250000	1.0	0.041667	0.0	0.000000	1.0	0.041667	NaN
3	01M015	All Grades	All Students	71	284.211273	34.0	0.478873	33.0	0.464789	4.0	...	23.0	0.323944	2.0	0.028169	1.0	0.014085	3.0	0.042254	NaN
4	01M015	3	Not SWD	19	287.157898	11.0	0.578947	8.0	0.421053	0.0	...	8.0	0.421053	0.0	0.000000	0.0	0.000000	0.0	0.000000	NaN

5 rows × 32 columns

Now that we have the wide data, we can easily compare math and ela results for the same set school/year/grade/category. Below we will calculate a new column called test_delta that shows the difference between math and ela results for the same cohort of students in the same school. This will let us see, for example, which if any schools are “unbalanced” in math and ELA. Later, we might correlate this to other demographic data such as high percentages of ELL or SWD students.

wide["test_delta"] = wide.mean_scale_score_math - wide.mean_scale_score_ela
wide["test_delta_abs"] = abs(wide.mean_scale_score_math - wide.mean_scale_score_ela)

wide = wide.sort_values(by="test_delta_abs", ascending=False)

wide[["dbn","grade","ay","category","mean_scale_score_math", "mean_scale_score_ela", "test_delta"]].head()

	dbn	grade	ay	category	mean_scale_score_math	mean_scale_score_ela	test_delta
28254	03M054	8	2014	Never ELL	243.764710	331.356811	-87.592102
82041	07X500	8	2014	Female	210.500000	296.057129	-85.557129
28230	03M054	8	2014	Female	244.666672	329.838715	-85.172043
82020	07X500	8	2014	Not SWD	222.750000	301.250000	-78.500000
240736	20K105	5	2015	Current ELL	330.679260	252.888885	77.790375

Concatenate DataFrames: long data#

If we have two DataFrames with the same columns (like our math and ela results) we can concatenate the data from one df onto the other to create a longer dataframe. Generally, this is a more flexible format for analysis than the wide data format. In our math/ela example, it allows us to, for example, easily average the math and ela test scores together (filtering for group, year, etc).

In the example here we will add a new column, test, which will have the value of “ela” or “math”. This column will let us know whether that row reports a math test or ela test result.

math["test"] = "math"
ela["test"] = "ela"
long = pd.concat([math, ela])
long

	dbn	grade	category	number_tested	mean_scale_score	level_1_n	level_1_pct	level_2_n	level_2_pct	level_3_n	level_3_pct	level_4_n	level_4_pct	level_3_4_n	level_3_4_pct	test_year	ay	charter	school_name	test
0	01M015	3	All Students	27	277.777771	16.0	0.592593	11.0	0.407407	0.0	0.000000	0.0	0.000000	0.0	0.000000	2013	2012	0	NaN	math
7	01M015	4	All Students	20	277.399994	12.0	0.600000	6.0	0.300000	1.0	0.050000	1.0	0.050000	2.0	0.100000	2013	2012	0	NaN	math
14	01M015	5	All Students	24	274.000000	17.0	0.708333	6.0	0.250000	1.0	0.041667	0.0	0.000000	1.0	0.041667	2013	2012	0	NaN	math
21	01M015	All Grades	All Students	71	276.394379	45.0	0.633803	23.0	0.323944	2.0	0.028169	1.0	0.014085	3.0	0.042254	2013	2012	0	NaN	math
32606	01M015	3	Not SWD	19	275.736847	11.0	0.578947	8.0	0.421053	0.0	0.000000	0.0	0.000000	0.0	0.000000	2013	2012	0	NaN	math
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
430002	84X730	3	All Students	51	602.000000	2.0	0.039000	22.0	0.431000	23.0	0.451000	4.0	0.078000	27.0	0.529000	2019	2018	1	BRONX CHARTER SCHOOL FOR THE ARTS	ela
430009	84X730	4	All Students	49	591.000000	18.0	0.367000	14.0	0.286000	13.0	0.265000	4.0	0.082000	17.0	0.347000	2019	2018	1	BRONX CHARTER SCHOOL FOR THE ARTS	ela
430016	84X730	5	All Students	53	597.000000	18.0	0.340000	20.0	0.377000	8.0	0.151000	7.0	0.132000	15.0	0.283000	2019	2018	1	BRONX CHARTER SCHOOL FOR THE ARTS	ela
430017	84X730	6	All Students	113	592.000000	45.0	0.398000	33.0	0.292000	23.0	0.204000	12.0	0.106000	35.0	0.310000	2019	2018	1	BRONX CHARTER SCHOOL FOR THE ARTS	ela
430024	84X730	All Grades	All Students	266	594.000000	83.0	0.312000	89.0	0.335000	67.0	0.252000	27.0	0.102000	94.0	0.353000	2019	2018	1	BRONX CHARTER SCHOOL FOR THE ARTS	ela

856567 rows × 20 columns

NYC Schools Open Data Portal

Merges & Joins

Contents

Merges & Joins#

Merging Data#

Merge with suffixes: wide data#

Concatenate DataFrames: long data#