Querying & Selecting#

In this Notebook we go over some of the ways of querying data in a dataframe.

  • [] slice notation for subsets of data

  • query() method of DataFrame to search data

  • using isin(), isnull()

  • drop_duplicates()

# import schools from the nycschool package
from nycschools import schools, exams

# load the demographic data into a `DataFrame` called df
df = schools.load_school_demographics()
# we already saw how to get a subset of columns
# let's get just these columns from out data
cols = ["dbn", "ay", "school_name", "district", "poverty_pct", "ell_pct", "swd_pct"]
df = df[cols]
df
dbn ay school_name district poverty_pct ell_pct swd_pct
0 01M015 2016 P.S. 015 Roberto Clemente 1 0.854 0.067 0.287000
1 01M015 2017 P.S. 015 Roberto Clemente 1 0.847 0.042 0.258000
2 01M015 2018 P.S. 015 Roberto Clemente 1 0.845 0.046 0.224000
3 01M015 2019 P.S. 015 Roberto Clemente 1 0.816 0.089 0.242000
4 01M015 2020 P.S. 015 Roberto Clemente 1 0.819 0.109 0.223000
... ... ... ... ... ... ... ...
9996 84X730 2016 Bronx Charter School for the Arts 84 0.734 0.159 0.209375
9997 84X730 2017 Bronx Charter School for the Arts 84 0.822 0.182 0.216561
9998 84X730 2018 Bronx Charter School for the Arts 84 0.844 0.165 0.239535
9999 84X730 2019 Bronx Charter School for the Arts 84 0.866 0.132 0.223709
10000 84X730 2020 Bronx Charter School for the Arts 84 0.864 0.125 0.244409

10001 rows × 7 columns

Selecting just one year#

We can see that our data contains multiple years of data for each school. We can filter or query the data to just get a single year.

To do this, we will use the slice notation similary to above, however, instead of a list of columns, we have a Boolean expression using one of the columns in our data set.

df[df.ay == 2020] returns only the rows where ay equals 2020

ay_2020 = df[df.ay == 2020]
ay_2020.head()
dbn ay school_name district poverty_pct ell_pct swd_pct
4 01M015 2020 P.S. 015 Roberto Clemente 1 0.819 0.109 0.223
9 01M019 2020 P.S. 019 Asher Levy 1 0.712 0.042 0.392
14 01M020 2020 P.S. 020 Anna Silver 1 0.709 0.119 0.218
19 01M034 2020 P.S. 034 Franklin D. Roosevelt 1 0.960 0.062 0.392
24 01M063 2020 The STAR Academy - P.S.63 1 0.769 0.014 0.279

Compound expressions#

We can use Boolean operators to make more complex queries. pandas uses the & operator for Boolean and and the | operator for Boolean or. Note that you should wrap your equality expressions inside of parenethesis.

To write bug-free code, we do not recommend mixing and and or clauses in the same query.

Below are two examples:

  • df[(df.district == 13) & (df.ay == 2020)] find data forschools in district 13 and academic year 2020-21

  • df[(df.poverty_pct > .9) | (df.ell_pct >= .4)] find school data where the school’s poverty percent is greater than 90% or the percent of studnts classified as English Language Learners is greater than or equal to 40%

df[(df.district == 13) & (df.ay == 2020)].head()
dbn ay school_name district poverty_pct ell_pct swd_pct
1556 13K869 2020 District 13 PRE-K Center 13 0.076 0.000 0.015
2129 13K915 2020 I.S. 915 13 0.309 0.023 0.201
2130 13K915 2020 I.S. 915 13 0.309 0.023 0.201
2131 13K915 2020 I.S. 915 13 0.309 0.023 0.201
2132 13K915 2020 I.S. 915 13 0.309 0.023 0.201
df[(df.poverty_pct > .9) | (df.ell_pct >= .4)].head()
dbn ay school_name district poverty_pct ell_pct swd_pct
15 01M034 2016 P.S. 034 Franklin D. Roosevelt 1 0.96 0.077 0.371
16 01M034 2017 P.S. 034 Franklin D. Roosevelt 1 0.96 0.075 0.372
17 01M034 2018 P.S. 034 Franklin D. Roosevelt 1 0.96 0.072 0.384
18 01M034 2019 P.S. 034 Franklin D. Roosevelt 1 0.96 0.057 0.395
19 01M034 2020 P.S. 034 Franklin D. Roosevelt 1 0.96 0.062 0.392
# this works, but it is confusing and error prone
df[(df.ay == 2020)  & ((df.poverty_pct > .9) | (df.ell_pct >= .4))].head()
dbn ay school_name district poverty_pct ell_pct swd_pct
19 01M034 2020 P.S. 034 Franklin D. Roosevelt 1 0.960 0.062 0.392
29 01M064 2020 P.S. 064 Robert Simon 1 0.914 0.018 0.264
39 01M134 2020 P.S. 134 Henrietta Szold 1 0.949 0.055 0.419
44 01M140 2020 P.S. 140 Nathan Straus 1 0.950 0.073 0.372
49 01M142 2020 P.S. 142 Amalia Castro 1 0.946 0.066 0.284
# rather than mixing | and &, we can write our code more clearly in two lines
data = df[df.ay == 2020]
data = data[(data.poverty_pct > .9) | (data.ell_pct >= .4)]
data.head()
dbn ay school_name district poverty_pct ell_pct swd_pct
19 01M034 2020 P.S. 034 Franklin D. Roosevelt 1 0.960 0.062 0.392
29 01M064 2020 P.S. 064 Robert Simon 1 0.914 0.018 0.264
39 01M134 2020 P.S. 134 Henrietta Szold 1 0.949 0.055 0.419
44 01M140 2020 P.S. 140 Nathan Straus 1 0.950 0.073 0.372
49 01M142 2020 P.S. 142 Amalia Castro 1 0.946 0.066 0.284

Finding a subset of data#

DataFrame has a usefule function called isin() that lets us filter data on a Series if the value is in a set of values. First, we’ll do a simple example where we find all of the schools in a district [3,8,15]

districts = [3, 8, 5]
data = df[df.district.isin(districts)]
data[["dbn", "district", "school_name"]]
dbn district school_name
882 03M009 3 P.S. 009 Sarah Anderson
883 03M009 3 P.S. 009 Sarah Anderson
884 03M009 3 P.S. 009 Sarah Anderson
885 03M009 3 P.S. 009 Sarah Anderson
886 03M009 3 P.S. 009 Sarah Anderson
... ... ... ...
2142 08X562 8 Blueprint Middle School
2143 08X583 8 P.S. 583
2144 08X583 8 P.S. 583
2145 08X583 8 P.S. 583
2146 08X583 8 P.S. 583

675 rows × 3 columns

Using drop_duplicates#

In the display of the table above, since we chose not to show all of the columns, we have repeated columns. Sometimes we will want just the unique results. We saw that unique() works for a Series. It doesn’t work for a dataframe, however.

Luckily, we have drop_duplicates() which serves a similar function…see:

data[["dbn", "district", "school_name"]].drop_duplicates()
dbn district school_name
882 03M009 3 P.S. 009 Sarah Anderson
887 03M054 3 J.H.S. 054 Booker T. Washington
892 03M075 3 P.S. 075 Emily Dickinson
897 03M076 3 P.S. 076 A. Philip Randolph
902 03M084 3 P.S. 084 Lillian Weber
... ... ... ...
2119 08X558 8 Westchester Square Academy
2124 08X559 8 School for Tourism and Hospitality
2133 08X561 8 Bronx Compass High School
2138 08X562 8 Blueprint Middle School
2143 08X583 8 P.S. 583

124 rows × 3 columns

Filtering missing data#

You will not always have complete data, and sometimes you need to find rows with missing records. You may want to update them with some default or derived values, or, you may simpley want to exclude them from your data.

In this example, we will use nycschools to load the grade 3-8 ELA NYS exams. We will then filter out all of the rows that don’t have a score. Some rows are null because not enough students took the exam. Others may be missing data because their students were exempt from the exam or other reasons.

from nycschools import exams
ela = exams.load_ela()
print(f"There are {len(ela):,} total records in ela")
ela.head()
There are 430,025 total records in ela
dbn grade category number_tested mean_scale_score level_1_n level_1_pct level_2_n level_2_pct level_3_n level_3_pct level_4_n level_4_pct level_3_4_n level_3_4_pct test_year ay charter school_name
0 01M015 3 All Students 27 289.296295 14.0 0.518519 11.0 0.407407 2.0 0.074074 0.0 0.0 2.0 0.074074 2013 2012 0 NaN
7 01M015 4 All Students 20 277.649994 8.0 0.400000 11.0 0.550000 1.0 0.050000 0.0 0.0 1.0 0.050000 2013 2012 0 NaN
14 01M015 5 All Students 24 283.958344 12.0 0.500000 11.0 0.458333 1.0 0.041667 0.0 0.0 1.0 0.041667 2013 2012 0 NaN
21 01M015 All Grades All Students 71 284.211273 34.0 0.478873 33.0 0.464789 4.0 0.056338 0.0 0.0 4.0 0.056338 2013 2012 0 NaN
32826 01M015 3 Not SWD 19 287.157898 11.0 0.578947 8.0 0.421053 0.0 0.000000 0.0 0.0 0.0 0.000000 2013 2012 0 NaN
missing = ela[ela.mean_scale_score.isnull()]
print(f"There are {len(missing):,} records in ela with null values for mean_scale_score. NaN means 'Not a Number'")

missing
There are 90,576 records in ela with null values for mean_scale_score. NaN means 'Not a Number'
dbn grade category number_tested mean_scale_score level_1_n level_1_pct level_2_n level_2_pct level_3_n level_3_pct level_4_n level_4_pct level_3_4_n level_3_4_pct test_year ay charter school_name
97989 01M015 3 Asian 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013 2012 0 NaN
97995 01M015 4 Asian 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013 2012 0 NaN
98001 01M015 5 Asian 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013 2012 0 NaN
98007 01M015 All Grades Asian 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013 2012 0 NaN
98014 01M015 3 Black 11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013 2012 0 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
343977 32K562 6 Not Econ Disadv 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2019 2018 0 NaN
427666 84M481 5 All Students 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013 2012 1 DEMOCRACY PREP HARLEM CHARTER SCHOOL
427675 84M481 7 All Students 110 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2013 2012 1 DEMOCRACY PREP HARLEM CHARTER SCHOOL
429139 84X394 5 All Students 31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2014 2013 1 MOTT HAVEN ACADEMY CHARTER SCHOOL
429145 84X394 6 All Students 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2014 2013 1 MOTT HAVEN ACADEMY CHARTER SCHOOL

90576 rows × 19 columns