Merges & Joins
Contents
Merges & Joins#
Here, finally, we combine data sets together. I say finally because this is one of the first things you will do when working with real world data. nycschools
has already combined multiple data sets into its core DataFrames, but you will want to combine this data in new ways, make new DataFrames from your results, and pull in new data from the outside world.
import pandas as pd
from nycschools import schools, exams
# school demographics in one frame
demo = schools.load_school_demographics()
# exam results in another frame
ela = exams.load_ela()
math = exams.load_math()
Merging Data#
When two data sets have a shared key then combining them into a single dataframe is straightforward. Here we use the merge()
function in DataFrame
to combine the school demographcs and ela test results into a single data frame. We merge them “on” the dbn
and ay
columns because these cols represent a unique identifier for each row in both data sets. Note that the result has only the intersecton of both datasets. If our demographic data has a DBN that’s not in the test data set, that school’s data will be dropped from the results.
demo.merge(ela, on=["dbn", "ay"]).head()
dbn | beds | district | geo_district | boro | school_name_x | short_name | ay | year | total_enrollment | ... | level_2_pct | level_3_n | level_3_pct | level_4_n | level_4_pct | level_3_4_n | level_3_4_pct | test_year | charter | school_name_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01M015 | 310100010015 | 1 | 1 | Manhattan | P.S. 015 Roberto Clemente | PS 15 | 2016 | 2016-17 | 178 | ... | 0.296296 | 7.0 | 0.259259 | 2.0 | 0.074074 | 9.0 | 0.333333 | 2017 | 0 | NaN |
1 | 01M015 | 310100010015 | 1 | 1 | Manhattan | P.S. 015 Roberto Clemente | PS 15 | 2016 | 2016-17 | 178 | ... | 0.521739 | 9.0 | 0.391304 | 1.0 | 0.043478 | 10.0 | 0.434783 | 2017 | 0 | NaN |
2 | 01M015 | 310100010015 | 1 | 1 | Manhattan | P.S. 015 Roberto Clemente | PS 15 | 2016 | 2016-17 | 178 | ... | 0.647059 | 5.0 | 0.294118 | 0.0 | 0.000000 | 5.0 | 0.294118 | 2017 | 0 | NaN |
3 | 01M015 | 310100010015 | 1 | 1 | Manhattan | P.S. 015 Roberto Clemente | PS 15 | 2016 | 2016-17 | 178 | ... | 0.462687 | 21.0 | 0.313433 | 3.0 | 0.044776 | 24.0 | 0.358209 | 2017 | 0 | NaN |
4 | 01M015 | 310100010015 | 1 | 1 | Manhattan | P.S. 015 Roberto Clemente | PS 15 | 2016 | 2016-17 | 178 | ... | 0.333333 | 7.0 | 0.333333 | 2.0 | 0.095238 | 9.0 | 0.428571 | 2017 | 0 | NaN |
5 rows × 68 columns
Astute readers will see a new column called school_name_x
in our results. You can’t see it in the truncated data above, but there is also a school_name_y
. Both data sets had a school_name
column, so pandas
suffixed the one from demo
with x
and the one from ela
with y
.
Because both columns should contain the same data (because they have the save DBN for the same school year), in this case we can just drop one of the columns before the merge. We’ll drop school_name
from ela
because the demographics data set should have our canoninical school names.
ela2 = ela.drop(columns="school_name")
demo.merge(ela2, on=["dbn", "ay"]).head()
dbn | beds | district | geo_district | boro | school_name | short_name | ay | year | total_enrollment | ... | level_2_n | level_2_pct | level_3_n | level_3_pct | level_4_n | level_4_pct | level_3_4_n | level_3_4_pct | test_year | charter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01M015 | 310100010015 | 1 | 1 | Manhattan | P.S. 015 Roberto Clemente | PS 15 | 2016 | 2016-17 | 178 | ... | 8.0 | 0.296296 | 7.0 | 0.259259 | 2.0 | 0.074074 | 9.0 | 0.333333 | 2017 | 0 |
1 | 01M015 | 310100010015 | 1 | 1 | Manhattan | P.S. 015 Roberto Clemente | PS 15 | 2016 | 2016-17 | 178 | ... | 12.0 | 0.521739 | 9.0 | 0.391304 | 1.0 | 0.043478 | 10.0 | 0.434783 | 2017 | 0 |
2 | 01M015 | 310100010015 | 1 | 1 | Manhattan | P.S. 015 Roberto Clemente | PS 15 | 2016 | 2016-17 | 178 | ... | 11.0 | 0.647059 | 5.0 | 0.294118 | 0.0 | 0.000000 | 5.0 | 0.294118 | 2017 | 0 |
3 | 01M015 | 310100010015 | 1 | 1 | Manhattan | P.S. 015 Roberto Clemente | PS 15 | 2016 | 2016-17 | 178 | ... | 31.0 | 0.462687 | 21.0 | 0.313433 | 3.0 | 0.044776 | 24.0 | 0.358209 | 2017 | 0 |
4 | 01M015 | 310100010015 | 1 | 1 | Manhattan | P.S. 015 Roberto Clemente | PS 15 | 2016 | 2016-17 | 178 | ... | 7.0 | 0.333333 | 7.0 | 0.333333 | 2.0 | 0.095238 | 9.0 | 0.428571 | 2017 | 0 |
5 rows × 67 columns
Merge with suffixes: wide data#
When we have a data set where we use columns to represent different categories of data, we can call this “wide data.” In this example, we’re going to make a test results data frame that has math and ela test scores in the same row.
At the start of this notebook we loaded both the math and ela exam data. Note that they have exactly the same columns. When we merge them, we’re going to suffix the ela data with _ela
and the math data with _math
. We will drop school_name
from one of the data sets and use ["dbn", "ay", "test_year", "grade", "category"]
to join the two.
ela2 = ela.drop(columns="school_name")
wide = ela2.merge(math, on=["dbn", "ay", "test_year", "grade", "category"], suffixes=["_ela", "_math"])
wide.head()
dbn | grade | category | number_tested_ela | mean_scale_score_ela | level_1_n_ela | level_1_pct_ela | level_2_n_ela | level_2_pct_ela | level_3_n_ela | ... | level_2_n_math | level_2_pct_math | level_3_n_math | level_3_pct_math | level_4_n_math | level_4_pct_math | level_3_4_n_math | level_3_4_pct_math | charter_math | school_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01M015 | 3 | All Students | 27 | 289.296295 | 14.0 | 0.518519 | 11.0 | 0.407407 | 2.0 | ... | 11.0 | 0.407407 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0 | NaN |
1 | 01M015 | 4 | All Students | 20 | 277.649994 | 8.0 | 0.400000 | 11.0 | 0.550000 | 1.0 | ... | 6.0 | 0.300000 | 1.0 | 0.050000 | 1.0 | 0.050000 | 2.0 | 0.100000 | 0 | NaN |
2 | 01M015 | 5 | All Students | 24 | 283.958344 | 12.0 | 0.500000 | 11.0 | 0.458333 | 1.0 | ... | 6.0 | 0.250000 | 1.0 | 0.041667 | 0.0 | 0.000000 | 1.0 | 0.041667 | 0 | NaN |
3 | 01M015 | All Grades | All Students | 71 | 284.211273 | 34.0 | 0.478873 | 33.0 | 0.464789 | 4.0 | ... | 23.0 | 0.323944 | 2.0 | 0.028169 | 1.0 | 0.014085 | 3.0 | 0.042254 | 0 | NaN |
4 | 01M015 | 3 | Not SWD | 19 | 287.157898 | 11.0 | 0.578947 | 8.0 | 0.421053 | 0.0 | ... | 8.0 | 0.421053 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0 | NaN |
5 rows × 32 columns
Now that we have the wide data, we can easily compare math and ela results for the same set school/year/grade/category. Below we will calculate a new column called test_delta
that shows the difference between math and ela results for the same cohort of students in the same school. This will let us see, for example, which if any schools are “unbalanced” in math and ELA. Later, we might correlate this to other demographic data such as high percentages of ELL or SWD students.
wide["test_delta"] = wide.mean_scale_score_math - wide.mean_scale_score_ela
wide["test_delta_abs"] = abs(wide.mean_scale_score_math - wide.mean_scale_score_ela)
wide = wide.sort_values(by="test_delta_abs", ascending=False)
wide[["dbn","grade","ay","category","mean_scale_score_math", "mean_scale_score_ela", "test_delta"]].head()
dbn | grade | ay | category | mean_scale_score_math | mean_scale_score_ela | test_delta | |
---|---|---|---|---|---|---|---|
28254 | 03M054 | 8 | 2014 | Never ELL | 243.764710 | 331.356811 | -87.592102 |
82041 | 07X500 | 8 | 2014 | Female | 210.500000 | 296.057129 | -85.557129 |
28230 | 03M054 | 8 | 2014 | Female | 244.666672 | 329.838715 | -85.172043 |
82020 | 07X500 | 8 | 2014 | Not SWD | 222.750000 | 301.250000 | -78.500000 |
240736 | 20K105 | 5 | 2015 | Current ELL | 330.679260 | 252.888885 | 77.790375 |
Concatenate DataFrames: long data#
If we have two DataFrames with the same columns (like our math and ela results) we can concatenate the data from one df onto the other to create a longer dataframe. Generally, this is a more flexible format for analysis than the wide data format. In our math/ela example, it allows us to, for example, easily average the math and ela test scores together (filtering for group, year, etc).
In the example here we will add a new column, test
, which will have the value of “ela” or “math”. This column will let us know whether that row reports a math test or ela test result.
math["test"] = "math"
ela["test"] = "ela"
long = pd.concat([math, ela])
long
dbn | grade | category | number_tested | mean_scale_score | level_1_n | level_1_pct | level_2_n | level_2_pct | level_3_n | level_3_pct | level_4_n | level_4_pct | level_3_4_n | level_3_4_pct | test_year | ay | charter | school_name | test | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01M015 | 3 | All Students | 27 | 277.777771 | 16.0 | 0.592593 | 11.0 | 0.407407 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 2013 | 2012 | 0 | NaN | math |
7 | 01M015 | 4 | All Students | 20 | 277.399994 | 12.0 | 0.600000 | 6.0 | 0.300000 | 1.0 | 0.050000 | 1.0 | 0.050000 | 2.0 | 0.100000 | 2013 | 2012 | 0 | NaN | math |
14 | 01M015 | 5 | All Students | 24 | 274.000000 | 17.0 | 0.708333 | 6.0 | 0.250000 | 1.0 | 0.041667 | 0.0 | 0.000000 | 1.0 | 0.041667 | 2013 | 2012 | 0 | NaN | math |
21 | 01M015 | All Grades | All Students | 71 | 276.394379 | 45.0 | 0.633803 | 23.0 | 0.323944 | 2.0 | 0.028169 | 1.0 | 0.014085 | 3.0 | 0.042254 | 2013 | 2012 | 0 | NaN | math |
32606 | 01M015 | 3 | Not SWD | 19 | 275.736847 | 11.0 | 0.578947 | 8.0 | 0.421053 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 2013 | 2012 | 0 | NaN | math |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
430002 | 84X730 | 3 | All Students | 51 | 602.000000 | 2.0 | 0.039000 | 22.0 | 0.431000 | 23.0 | 0.451000 | 4.0 | 0.078000 | 27.0 | 0.529000 | 2019 | 2018 | 1 | BRONX CHARTER SCHOOL FOR THE ARTS | ela |
430009 | 84X730 | 4 | All Students | 49 | 591.000000 | 18.0 | 0.367000 | 14.0 | 0.286000 | 13.0 | 0.265000 | 4.0 | 0.082000 | 17.0 | 0.347000 | 2019 | 2018 | 1 | BRONX CHARTER SCHOOL FOR THE ARTS | ela |
430016 | 84X730 | 5 | All Students | 53 | 597.000000 | 18.0 | 0.340000 | 20.0 | 0.377000 | 8.0 | 0.151000 | 7.0 | 0.132000 | 15.0 | 0.283000 | 2019 | 2018 | 1 | BRONX CHARTER SCHOOL FOR THE ARTS | ela |
430017 | 84X730 | 6 | All Students | 113 | 592.000000 | 45.0 | 0.398000 | 33.0 | 0.292000 | 23.0 | 0.204000 | 12.0 | 0.106000 | 35.0 | 0.310000 | 2019 | 2018 | 1 | BRONX CHARTER SCHOOL FOR THE ARTS | ela |
430024 | 84X730 | All Grades | All Students | 266 | 594.000000 | 83.0 | 0.312000 | 89.0 | 0.335000 | 67.0 | 0.252000 | 27.0 | 0.102000 | 94.0 | 0.353000 | 2019 | 2018 | 1 | BRONX CHARTER SCHOOL FOR THE ARTS | ela |
856567 rows × 20 columns