Querying & Selecting#

In this Notebook we go over some of the ways of querying data in a dataframe.

[] slice notation for subsets of data
query() method of DataFrame to search data
using isin(), isnull()
drop_duplicates()

# import schools from the nycschool package
from nycschools import schools, exams

# load the demographic data into a `DataFrame` called df
df = schools.load_school_demographics()

# we already saw how to get a subset of columns
# let's get just these columns from out data
cols = ["dbn", "ay", "school_name", "district", "poverty_pct", "ell_pct", "swd_pct"]
df = df[cols]
df

	dbn	ay	school_name	district	poverty_pct	ell_pct	swd_pct
0	01M015	2016	P.S. 015 Roberto Clemente	1	0.854	0.067	0.287000
1	01M015	2017	P.S. 015 Roberto Clemente	1	0.847	0.042	0.258000
2	01M015	2018	P.S. 015 Roberto Clemente	1	0.845	0.046	0.224000
3	01M015	2019	P.S. 015 Roberto Clemente	1	0.816	0.089	0.242000
4	01M015	2020	P.S. 015 Roberto Clemente	1	0.819	0.109	0.223000
...	...	...	...	...	...	...	...
9996	84X730	2016	Bronx Charter School for the Arts	84	0.734	0.159	0.209375
9997	84X730	2017	Bronx Charter School for the Arts	84	0.822	0.182	0.216561
9998	84X730	2018	Bronx Charter School for the Arts	84	0.844	0.165	0.239535
9999	84X730	2019	Bronx Charter School for the Arts	84	0.866	0.132	0.223709
10000	84X730	2020	Bronx Charter School for the Arts	84	0.864	0.125	0.244409

10001 rows × 7 columns

Selecting just one year#

We can see that our data contains multiple years of data for each school. We can filter or query the data to just get a single year.

To do this, we will use the slice notation similary to above, however, instead of a list of columns, we have a Boolean expression using one of the columns in our data set.

df[df.ay == 2020] returns only the rows where ay equals 2020

ay_2020 = df[df.ay == 2020]
ay_2020.head()

	dbn	ay	school_name	district	poverty_pct	ell_pct	swd_pct
4	01M015	2020	P.S. 015 Roberto Clemente	1	0.819	0.109	0.223
9	01M019	2020	P.S. 019 Asher Levy	1	0.712	0.042	0.392
14	01M020	2020	P.S. 020 Anna Silver	1	0.709	0.119	0.218
19	01M034	2020	P.S. 034 Franklin D. Roosevelt	1	0.960	0.062	0.392
24	01M063	2020	The STAR Academy - P.S.63	1	0.769	0.014	0.279

Compound expressions#

We can use Boolean operators to make more complex queries. pandas uses the & operator for Boolean and and the | operator for Boolean or. Note that you should wrap your equality expressions inside of parenethesis.

To write bug-free code, we do not recommend mixing and and or clauses in the same query.

Below are two examples:

df[(df.district == 13) & (df.ay == 2020)] find data forschools in district 13 and academic year 2020-21
df[(df.poverty_pct > .9) | (df.ell_pct >= .4)] find school data where the school’s poverty percent is greater than 90% or the percent of studnts classified as English Language Learners is greater than or equal to 40%

df[(df.district == 13) & (df.ay == 2020)].head()

	dbn	ay	school_name	district	poverty_pct	ell_pct	swd_pct
1556	13K869	2020	District 13 PRE-K Center	13	0.076	0.000	0.015
2129	13K915	2020	I.S. 915	13	0.309	0.023	0.201
2130	13K915	2020	I.S. 915	13	0.309	0.023	0.201
2131	13K915	2020	I.S. 915	13	0.309	0.023	0.201
2132	13K915	2020	I.S. 915	13	0.309	0.023	0.201

df[(df.poverty_pct > .9) | (df.ell_pct >= .4)].head()

	dbn	ay	school_name	district	poverty_pct	ell_pct	swd_pct
15	01M034	2016	P.S. 034 Franklin D. Roosevelt	1	0.96	0.077	0.371
16	01M034	2017	P.S. 034 Franklin D. Roosevelt	1	0.96	0.075	0.372
17	01M034	2018	P.S. 034 Franklin D. Roosevelt	1	0.96	0.072	0.384
18	01M034	2019	P.S. 034 Franklin D. Roosevelt	1	0.96	0.057	0.395
19	01M034	2020	P.S. 034 Franklin D. Roosevelt	1	0.96	0.062	0.392

# this works, but it is confusing and error prone
df[(df.ay == 2020)  & ((df.poverty_pct > .9) | (df.ell_pct >= .4))].head()

	dbn	ay	school_name	district	poverty_pct	ell_pct	swd_pct
19	01M034	2020	P.S. 034 Franklin D. Roosevelt	1	0.960	0.062	0.392
29	01M064	2020	P.S. 064 Robert Simon	1	0.914	0.018	0.264
39	01M134	2020	P.S. 134 Henrietta Szold	1	0.949	0.055	0.419
44	01M140	2020	P.S. 140 Nathan Straus	1	0.950	0.073	0.372
49	01M142	2020	P.S. 142 Amalia Castro	1	0.946	0.066	0.284

# rather than mixing | and &, we can write our code more clearly in two lines
data = df[df.ay == 2020]
data = data[(data.poverty_pct > .9) | (data.ell_pct >= .4)]
data.head()

	dbn	ay	school_name	district	poverty_pct	ell_pct	swd_pct
19	01M034	2020	P.S. 034 Franklin D. Roosevelt	1	0.960	0.062	0.392
29	01M064	2020	P.S. 064 Robert Simon	1	0.914	0.018	0.264
39	01M134	2020	P.S. 134 Henrietta Szold	1	0.949	0.055	0.419
44	01M140	2020	P.S. 140 Nathan Straus	1	0.950	0.073	0.372
49	01M142	2020	P.S. 142 Amalia Castro	1	0.946	0.066	0.284

Finding a subset of data#

DataFrame has a usefule function called isin() that lets us filter data on a Series if the value is in a set of values. First, we’ll do a simple example where we find all of the schools in a district [3,8,15]

districts = [3, 8, 5]
data = df[df.district.isin(districts)]
data[["dbn", "district", "school_name"]]

	dbn	district	school_name
882	03M009	3	P.S. 009 Sarah Anderson
883	03M009	3	P.S. 009 Sarah Anderson
884	03M009	3	P.S. 009 Sarah Anderson
885	03M009	3	P.S. 009 Sarah Anderson
886	03M009	3	P.S. 009 Sarah Anderson
...	...	...	...
2142	08X562	8	Blueprint Middle School
2143	08X583	8	P.S. 583
2144	08X583	8	P.S. 583
2145	08X583	8	P.S. 583
2146	08X583	8	P.S. 583

675 rows × 3 columns

Using `drop_duplicates`#

In the display of the table above, since we chose not to show all of the columns, we have repeated columns. Sometimes we will want just the unique results. We saw that unique() works for a Series. It doesn’t work for a dataframe, however.

Luckily, we have drop_duplicates() which serves a similar function…see:

data[["dbn", "district", "school_name"]].drop_duplicates()

	dbn	district	school_name
882	03M009	3	P.S. 009 Sarah Anderson
887	03M054	3	J.H.S. 054 Booker T. Washington
892	03M075	3	P.S. 075 Emily Dickinson
897	03M076	3	P.S. 076 A. Philip Randolph
902	03M084	3	P.S. 084 Lillian Weber
...	...	...	...
2119	08X558	8	Westchester Square Academy
2124	08X559	8	School for Tourism and Hospitality
2133	08X561	8	Bronx Compass High School
2138	08X562	8	Blueprint Middle School
2143	08X583	8	P.S. 583

124 rows × 3 columns

Filtering missing data#

You will not always have complete data, and sometimes you need to find rows with missing records. You may want to update them with some default or derived values, or, you may simpley want to exclude them from your data.

In this example, we will use nycschools to load the grade 3-8 ELA NYS exams. We will then filter out all of the rows that don’t have a score. Some rows are null because not enough students took the exam. Others may be missing data because their students were exempt from the exam or other reasons.

from nycschools import exams
ela = exams.load_ela()
print(f"There are {len(ela):,} total records in ela")
ela.head()

There are 430,025 total records in ela

	dbn	grade	category	number_tested	mean_scale_score	level_1_n	level_1_pct	level_2_n	level_2_pct	level_3_n	level_3_pct	level_3_4_n	level_3_4_pct	test_year	ay	school_name
0	01M015	3	All Students	27	289.296295	14.0	0.518519	11.0	0.407407	2.0	0.074074	2.0	0.074074	2013	2012	NaN
7	01M015	4	All Students	20	277.649994	8.0	0.400000	11.0	0.550000	1.0	0.050000	1.0	0.050000	2013	2012	NaN
14	01M015	5	All Students	24	283.958344	12.0	0.500000	11.0	0.458333	1.0	0.041667	1.0	0.041667	2013	2012	NaN
21	01M015	All Grades	All Students	71	284.211273	34.0	0.478873	33.0	0.464789	4.0	0.056338	4.0	0.056338	2013	2012	NaN
32826	01M015	3	Not SWD	19	287.157898	11.0	0.578947	8.0	0.421053	0.0	0.000000	0.0	0.000000	2013	2012	NaN

missing = ela[ela.mean_scale_score.isnull()]
print(f"There are {len(missing):,} records in ela with null values for mean_scale_score. NaN means 'Not a Number'")

missing

There are 90,576 records in ela with null values for mean_scale_score. NaN means 'Not a Number'

	dbn	grade	category	number_tested	mean_scale_score	level_1_n	level_1_pct	level_2_n	level_2_pct	level_3_n	level_3_pct	level_4_n	level_4_pct	level_3_4_n	level_3_4_pct	test_year	ay	charter	school_name
97989	01M015	3	Asian	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2013	2012	0	NaN
97995	01M015	4	Asian	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2013	2012	0	NaN
98001	01M015	5	Asian	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2013	2012	0	NaN
98007	01M015	All Grades	Asian	5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2013	2012	0	NaN
98014	01M015	3	Black	11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2013	2012	0	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
343977	32K562	6	Not Econ Disadv	3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2019	2018	0	NaN
427666	84M481	5	All Students	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2013	2012	1	DEMOCRACY PREP HARLEM CHARTER SCHOOL
427675	84M481	7	All Students	110	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2013	2012	1	DEMOCRACY PREP HARLEM CHARTER SCHOOL
429139	84X394	5	All Students	31	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2014	2013	1	MOTT HAVEN ACADEMY CHARTER SCHOOL
429145	84X394	6	All Students	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2014	2013	1	MOTT HAVEN ACADEMY CHARTER SCHOOL

90576 rows × 19 columns

NYC Schools Open Data Portal

Querying & Selecting

Contents

Querying & Selecting#

Selecting just one year#

Compound expressions#

Finding a subset of data#

Using `drop_duplicates`#

Filtering missing data#

NYC Schools Open Data Portal

Querying & Selecting

Contents

Querying & Selecting#

Selecting just one year#

Compound expressions#

Finding a subset of data#

Using drop_duplicates#

Filtering missing data#

Using `drop_duplicates`#