Querying & Selecting
Contents
Querying & Selecting#
In this Notebook we go over some of the ways of querying data in a dataframe.
[]
slice notation for subsets of dataquery()
method ofDataFrame
to search datausing
isin()
,isnull()
drop_duplicates()
# import schools from the nycschool package
from nycschools import schools, exams
# load the demographic data into a `DataFrame` called df
df = schools.load_school_demographics()
# we already saw how to get a subset of columns
# let's get just these columns from out data
cols = ["dbn", "ay", "school_name", "district", "poverty_pct", "ell_pct", "swd_pct"]
df = df[cols]
df
dbn | ay | school_name | district | poverty_pct | ell_pct | swd_pct | |
---|---|---|---|---|---|---|---|
0 | 01M015 | 2016 | P.S. 015 Roberto Clemente | 1 | 0.854 | 0.067 | 0.287000 |
1 | 01M015 | 2017 | P.S. 015 Roberto Clemente | 1 | 0.847 | 0.042 | 0.258000 |
2 | 01M015 | 2018 | P.S. 015 Roberto Clemente | 1 | 0.845 | 0.046 | 0.224000 |
3 | 01M015 | 2019 | P.S. 015 Roberto Clemente | 1 | 0.816 | 0.089 | 0.242000 |
4 | 01M015 | 2020 | P.S. 015 Roberto Clemente | 1 | 0.819 | 0.109 | 0.223000 |
... | ... | ... | ... | ... | ... | ... | ... |
9996 | 84X730 | 2016 | Bronx Charter School for the Arts | 84 | 0.734 | 0.159 | 0.209375 |
9997 | 84X730 | 2017 | Bronx Charter School for the Arts | 84 | 0.822 | 0.182 | 0.216561 |
9998 | 84X730 | 2018 | Bronx Charter School for the Arts | 84 | 0.844 | 0.165 | 0.239535 |
9999 | 84X730 | 2019 | Bronx Charter School for the Arts | 84 | 0.866 | 0.132 | 0.223709 |
10000 | 84X730 | 2020 | Bronx Charter School for the Arts | 84 | 0.864 | 0.125 | 0.244409 |
10001 rows × 7 columns
Selecting just one year#
We can see that our data contains multiple years of data for each school. We can filter or query the data to just get a single year.
To do this, we will use the slice notation similary to above, however, instead of a list of columns, we have a Boolean expression using one of the columns in our data set.
df[df.ay == 2020]
returns only the rows where ay
equals 2020
ay_2020 = df[df.ay == 2020]
ay_2020.head()
dbn | ay | school_name | district | poverty_pct | ell_pct | swd_pct | |
---|---|---|---|---|---|---|---|
4 | 01M015 | 2020 | P.S. 015 Roberto Clemente | 1 | 0.819 | 0.109 | 0.223 |
9 | 01M019 | 2020 | P.S. 019 Asher Levy | 1 | 0.712 | 0.042 | 0.392 |
14 | 01M020 | 2020 | P.S. 020 Anna Silver | 1 | 0.709 | 0.119 | 0.218 |
19 | 01M034 | 2020 | P.S. 034 Franklin D. Roosevelt | 1 | 0.960 | 0.062 | 0.392 |
24 | 01M063 | 2020 | The STAR Academy - P.S.63 | 1 | 0.769 | 0.014 | 0.279 |
Compound expressions#
We can use Boolean operators to make more complex queries.
pandas
uses the &
operator for Boolean and
and the |
operator for Boolean or. Note that you should wrap
your equality expressions inside of parenethesis.
To write bug-free code, we do not recommend mixing and
and or
clauses in the same query.
Below are two examples:
df[(df.district == 13) & (df.ay == 2020)]
find data forschools in district 13 and academic year 2020-21df[(df.poverty_pct > .9) | (df.ell_pct >= .4)]
find school data where the school’s poverty percent is greater than 90% or the percent of studnts classified as English Language Learners is greater than or equal to 40%
df[(df.district == 13) & (df.ay == 2020)].head()
dbn | ay | school_name | district | poverty_pct | ell_pct | swd_pct | |
---|---|---|---|---|---|---|---|
1556 | 13K869 | 2020 | District 13 PRE-K Center | 13 | 0.076 | 0.000 | 0.015 |
2129 | 13K915 | 2020 | I.S. 915 | 13 | 0.309 | 0.023 | 0.201 |
2130 | 13K915 | 2020 | I.S. 915 | 13 | 0.309 | 0.023 | 0.201 |
2131 | 13K915 | 2020 | I.S. 915 | 13 | 0.309 | 0.023 | 0.201 |
2132 | 13K915 | 2020 | I.S. 915 | 13 | 0.309 | 0.023 | 0.201 |
df[(df.poverty_pct > .9) | (df.ell_pct >= .4)].head()
dbn | ay | school_name | district | poverty_pct | ell_pct | swd_pct | |
---|---|---|---|---|---|---|---|
15 | 01M034 | 2016 | P.S. 034 Franklin D. Roosevelt | 1 | 0.96 | 0.077 | 0.371 |
16 | 01M034 | 2017 | P.S. 034 Franklin D. Roosevelt | 1 | 0.96 | 0.075 | 0.372 |
17 | 01M034 | 2018 | P.S. 034 Franklin D. Roosevelt | 1 | 0.96 | 0.072 | 0.384 |
18 | 01M034 | 2019 | P.S. 034 Franklin D. Roosevelt | 1 | 0.96 | 0.057 | 0.395 |
19 | 01M034 | 2020 | P.S. 034 Franklin D. Roosevelt | 1 | 0.96 | 0.062 | 0.392 |
# this works, but it is confusing and error prone
df[(df.ay == 2020) & ((df.poverty_pct > .9) | (df.ell_pct >= .4))].head()
dbn | ay | school_name | district | poverty_pct | ell_pct | swd_pct | |
---|---|---|---|---|---|---|---|
19 | 01M034 | 2020 | P.S. 034 Franklin D. Roosevelt | 1 | 0.960 | 0.062 | 0.392 |
29 | 01M064 | 2020 | P.S. 064 Robert Simon | 1 | 0.914 | 0.018 | 0.264 |
39 | 01M134 | 2020 | P.S. 134 Henrietta Szold | 1 | 0.949 | 0.055 | 0.419 |
44 | 01M140 | 2020 | P.S. 140 Nathan Straus | 1 | 0.950 | 0.073 | 0.372 |
49 | 01M142 | 2020 | P.S. 142 Amalia Castro | 1 | 0.946 | 0.066 | 0.284 |
# rather than mixing | and &, we can write our code more clearly in two lines
data = df[df.ay == 2020]
data = data[(data.poverty_pct > .9) | (data.ell_pct >= .4)]
data.head()
dbn | ay | school_name | district | poverty_pct | ell_pct | swd_pct | |
---|---|---|---|---|---|---|---|
19 | 01M034 | 2020 | P.S. 034 Franklin D. Roosevelt | 1 | 0.960 | 0.062 | 0.392 |
29 | 01M064 | 2020 | P.S. 064 Robert Simon | 1 | 0.914 | 0.018 | 0.264 |
39 | 01M134 | 2020 | P.S. 134 Henrietta Szold | 1 | 0.949 | 0.055 | 0.419 |
44 | 01M140 | 2020 | P.S. 140 Nathan Straus | 1 | 0.950 | 0.073 | 0.372 |
49 | 01M142 | 2020 | P.S. 142 Amalia Castro | 1 | 0.946 | 0.066 | 0.284 |
Finding a subset of data#
DataFrame
has a usefule function called isin()
that lets us filter data on a Series
if the value is in a set of values. First, we’ll do a simple example where we find
all of the schools in a district [3,8,15]
districts = [3, 8, 5]
data = df[df.district.isin(districts)]
data[["dbn", "district", "school_name"]]
dbn | district | school_name | |
---|---|---|---|
882 | 03M009 | 3 | P.S. 009 Sarah Anderson |
883 | 03M009 | 3 | P.S. 009 Sarah Anderson |
884 | 03M009 | 3 | P.S. 009 Sarah Anderson |
885 | 03M009 | 3 | P.S. 009 Sarah Anderson |
886 | 03M009 | 3 | P.S. 009 Sarah Anderson |
... | ... | ... | ... |
2142 | 08X562 | 8 | Blueprint Middle School |
2143 | 08X583 | 8 | P.S. 583 |
2144 | 08X583 | 8 | P.S. 583 |
2145 | 08X583 | 8 | P.S. 583 |
2146 | 08X583 | 8 | P.S. 583 |
675 rows × 3 columns
Using drop_duplicates
#
In the display of the table above, since we chose not to show all of the columns, we have repeated columns. Sometimes we will want just the unique results. We saw that unique()
works for a Series
. It doesn’t work for a dataframe, however.
Luckily, we have drop_duplicates()
which serves a similar function…see:
data[["dbn", "district", "school_name"]].drop_duplicates()
dbn | district | school_name | |
---|---|---|---|
882 | 03M009 | 3 | P.S. 009 Sarah Anderson |
887 | 03M054 | 3 | J.H.S. 054 Booker T. Washington |
892 | 03M075 | 3 | P.S. 075 Emily Dickinson |
897 | 03M076 | 3 | P.S. 076 A. Philip Randolph |
902 | 03M084 | 3 | P.S. 084 Lillian Weber |
... | ... | ... | ... |
2119 | 08X558 | 8 | Westchester Square Academy |
2124 | 08X559 | 8 | School for Tourism and Hospitality |
2133 | 08X561 | 8 | Bronx Compass High School |
2138 | 08X562 | 8 | Blueprint Middle School |
2143 | 08X583 | 8 | P.S. 583 |
124 rows × 3 columns
Filtering missing data#
You will not always have complete data, and sometimes you need to find rows with missing records. You may want to update them with some default or derived values, or, you may simpley want to exclude them from your data.
In this example, we will use nycschools
to load the grade 3-8 ELA NYS exams. We will then filter out all of the rows that don’t have a score. Some rows are null because not enough students took the exam. Others may be missing data because their students were exempt from the exam or other reasons.
from nycschools import exams
ela = exams.load_ela()
print(f"There are {len(ela):,} total records in ela")
ela.head()
There are 430,025 total records in ela
dbn | grade | category | number_tested | mean_scale_score | level_1_n | level_1_pct | level_2_n | level_2_pct | level_3_n | level_3_pct | level_4_n | level_4_pct | level_3_4_n | level_3_4_pct | test_year | ay | charter | school_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01M015 | 3 | All Students | 27 | 289.296295 | 14.0 | 0.518519 | 11.0 | 0.407407 | 2.0 | 0.074074 | 0.0 | 0.0 | 2.0 | 0.074074 | 2013 | 2012 | 0 | NaN |
7 | 01M015 | 4 | All Students | 20 | 277.649994 | 8.0 | 0.400000 | 11.0 | 0.550000 | 1.0 | 0.050000 | 0.0 | 0.0 | 1.0 | 0.050000 | 2013 | 2012 | 0 | NaN |
14 | 01M015 | 5 | All Students | 24 | 283.958344 | 12.0 | 0.500000 | 11.0 | 0.458333 | 1.0 | 0.041667 | 0.0 | 0.0 | 1.0 | 0.041667 | 2013 | 2012 | 0 | NaN |
21 | 01M015 | All Grades | All Students | 71 | 284.211273 | 34.0 | 0.478873 | 33.0 | 0.464789 | 4.0 | 0.056338 | 0.0 | 0.0 | 4.0 | 0.056338 | 2013 | 2012 | 0 | NaN |
32826 | 01M015 | 3 | Not SWD | 19 | 287.157898 | 11.0 | 0.578947 | 8.0 | 0.421053 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 2013 | 2012 | 0 | NaN |
missing = ela[ela.mean_scale_score.isnull()]
print(f"There are {len(missing):,} records in ela with null values for mean_scale_score. NaN means 'Not a Number'")
missing
There are 90,576 records in ela with null values for mean_scale_score. NaN means 'Not a Number'
dbn | grade | category | number_tested | mean_scale_score | level_1_n | level_1_pct | level_2_n | level_2_pct | level_3_n | level_3_pct | level_4_n | level_4_pct | level_3_4_n | level_3_4_pct | test_year | ay | charter | school_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
97989 | 01M015 | 3 | Asian | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013 | 2012 | 0 | NaN |
97995 | 01M015 | 4 | Asian | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013 | 2012 | 0 | NaN |
98001 | 01M015 | 5 | Asian | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013 | 2012 | 0 | NaN |
98007 | 01M015 | All Grades | Asian | 5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013 | 2012 | 0 | NaN |
98014 | 01M015 | 3 | Black | 11 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013 | 2012 | 0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
343977 | 32K562 | 6 | Not Econ Disadv | 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2019 | 2018 | 0 | NaN |
427666 | 84M481 | 5 | All Students | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013 | 2012 | 1 | DEMOCRACY PREP HARLEM CHARTER SCHOOL |
427675 | 84M481 | 7 | All Students | 110 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2013 | 2012 | 1 | DEMOCRACY PREP HARLEM CHARTER SCHOOL |
429139 | 84X394 | 5 | All Students | 31 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2014 | 2013 | 1 | MOTT HAVEN ACADEMY CHARTER SCHOOL |
429145 | 84X394 | 6 | All Students | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2014 | 2013 | 1 | MOTT HAVEN ACADEMY CHARTER SCHOOL |
90576 rows × 19 columns