{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "e81cdfdf",
"metadata": {},
"source": [
"Querying & Selecting\n",
"====================\n",
"\n",
"In this Notebook we go over some of the ways of querying data in a dataframe.\n",
"\n",
"- `[]` slice notation for subsets of data\n",
"- `query()` method of `DataFrame` to search data\n",
"- using `isin()`, `isnull()`\n",
"- drop_duplicates()\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "55795e7b",
"metadata": {},
"outputs": [],
"source": [
"# import schools from the nycschool package\n",
"from nycschools import schools, exams\n",
"\n",
"# load the demographic data into a `DataFrame` called df\n",
"df = schools.load_school_demographics()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "59e7529a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" dbn | \n",
" ay | \n",
" school_name | \n",
" district | \n",
" poverty_pct | \n",
" ell_pct | \n",
" swd_pct | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 01M015 | \n",
" 2016 | \n",
" P.S. 015 Roberto Clemente | \n",
" 1 | \n",
" 0.854 | \n",
" 0.067 | \n",
" 0.287000 | \n",
"
\n",
" \n",
" 1 | \n",
" 01M015 | \n",
" 2017 | \n",
" P.S. 015 Roberto Clemente | \n",
" 1 | \n",
" 0.847 | \n",
" 0.042 | \n",
" 0.258000 | \n",
"
\n",
" \n",
" 2 | \n",
" 01M015 | \n",
" 2018 | \n",
" P.S. 015 Roberto Clemente | \n",
" 1 | \n",
" 0.845 | \n",
" 0.046 | \n",
" 0.224000 | \n",
"
\n",
" \n",
" 3 | \n",
" 01M015 | \n",
" 2019 | \n",
" P.S. 015 Roberto Clemente | \n",
" 1 | \n",
" 0.816 | \n",
" 0.089 | \n",
" 0.242000 | \n",
"
\n",
" \n",
" 4 | \n",
" 01M015 | \n",
" 2020 | \n",
" P.S. 015 Roberto Clemente | \n",
" 1 | \n",
" 0.819 | \n",
" 0.109 | \n",
" 0.223000 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 9996 | \n",
" 84X730 | \n",
" 2016 | \n",
" Bronx Charter School for the Arts | \n",
" 84 | \n",
" 0.734 | \n",
" 0.159 | \n",
" 0.209375 | \n",
"
\n",
" \n",
" 9997 | \n",
" 84X730 | \n",
" 2017 | \n",
" Bronx Charter School for the Arts | \n",
" 84 | \n",
" 0.822 | \n",
" 0.182 | \n",
" 0.216561 | \n",
"
\n",
" \n",
" 9998 | \n",
" 84X730 | \n",
" 2018 | \n",
" Bronx Charter School for the Arts | \n",
" 84 | \n",
" 0.844 | \n",
" 0.165 | \n",
" 0.239535 | \n",
"
\n",
" \n",
" 9999 | \n",
" 84X730 | \n",
" 2019 | \n",
" Bronx Charter School for the Arts | \n",
" 84 | \n",
" 0.866 | \n",
" 0.132 | \n",
" 0.223709 | \n",
"
\n",
" \n",
" 10000 | \n",
" 84X730 | \n",
" 2020 | \n",
" Bronx Charter School for the Arts | \n",
" 84 | \n",
" 0.864 | \n",
" 0.125 | \n",
" 0.244409 | \n",
"
\n",
" \n",
"
\n",
"
10001 rows × 7 columns
\n",
"
"
],
"text/plain": [
" dbn ay school_name district poverty_pct \\\n",
"0 01M015 2016 P.S. 015 Roberto Clemente 1 0.854 \n",
"1 01M015 2017 P.S. 015 Roberto Clemente 1 0.847 \n",
"2 01M015 2018 P.S. 015 Roberto Clemente 1 0.845 \n",
"3 01M015 2019 P.S. 015 Roberto Clemente 1 0.816 \n",
"4 01M015 2020 P.S. 015 Roberto Clemente 1 0.819 \n",
"... ... ... ... ... ... \n",
"9996 84X730 2016 Bronx Charter School for the Arts 84 0.734 \n",
"9997 84X730 2017 Bronx Charter School for the Arts 84 0.822 \n",
"9998 84X730 2018 Bronx Charter School for the Arts 84 0.844 \n",
"9999 84X730 2019 Bronx Charter School for the Arts 84 0.866 \n",
"10000 84X730 2020 Bronx Charter School for the Arts 84 0.864 \n",
"\n",
" ell_pct swd_pct \n",
"0 0.067 0.287000 \n",
"1 0.042 0.258000 \n",
"2 0.046 0.224000 \n",
"3 0.089 0.242000 \n",
"4 0.109 0.223000 \n",
"... ... ... \n",
"9996 0.159 0.209375 \n",
"9997 0.182 0.216561 \n",
"9998 0.165 0.239535 \n",
"9999 0.132 0.223709 \n",
"10000 0.125 0.244409 \n",
"\n",
"[10001 rows x 7 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# we already saw how to get a subset of columns\n",
"# let's get just these columns from out data\n",
"cols = [\"dbn\", \"ay\", \"school_name\", \"district\", \"poverty_pct\", \"ell_pct\", \"swd_pct\"]\n",
"df = df[cols]\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "af71b571",
"metadata": {},
"source": [
"Selecting just one year\n",
"---------------------------------\n",
"We can see that our data contains multiple years of data for each school.\n",
"We can filter or query the data to just get a single year.\n",
"\n",
"To do this, we will use the slice notation similary to above, however,\n",
"instead of a list of columns, we have a Boolean expression using\n",
"one of the columns in our data set.\n",
"\n",
"`df[df.ay == 2020]` returns only the rows where `ay` equals `2020`\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "4fe49f51",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" dbn | \n",
" ay | \n",
" school_name | \n",
" district | \n",
" poverty_pct | \n",
" ell_pct | \n",
" swd_pct | \n",
"
\n",
" \n",
" \n",
" \n",
" 4 | \n",
" 01M015 | \n",
" 2020 | \n",
" P.S. 015 Roberto Clemente | \n",
" 1 | \n",
" 0.819 | \n",
" 0.109 | \n",
" 0.223 | \n",
"
\n",
" \n",
" 9 | \n",
" 01M019 | \n",
" 2020 | \n",
" P.S. 019 Asher Levy | \n",
" 1 | \n",
" 0.712 | \n",
" 0.042 | \n",
" 0.392 | \n",
"
\n",
" \n",
" 14 | \n",
" 01M020 | \n",
" 2020 | \n",
" P.S. 020 Anna Silver | \n",
" 1 | \n",
" 0.709 | \n",
" 0.119 | \n",
" 0.218 | \n",
"
\n",
" \n",
" 19 | \n",
" 01M034 | \n",
" 2020 | \n",
" P.S. 034 Franklin D. Roosevelt | \n",
" 1 | \n",
" 0.960 | \n",
" 0.062 | \n",
" 0.392 | \n",
"
\n",
" \n",
" 24 | \n",
" 01M063 | \n",
" 2020 | \n",
" The STAR Academy - P.S.63 | \n",
" 1 | \n",
" 0.769 | \n",
" 0.014 | \n",
" 0.279 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" dbn ay school_name district poverty_pct \\\n",
"4 01M015 2020 P.S. 015 Roberto Clemente 1 0.819 \n",
"9 01M019 2020 P.S. 019 Asher Levy 1 0.712 \n",
"14 01M020 2020 P.S. 020 Anna Silver 1 0.709 \n",
"19 01M034 2020 P.S. 034 Franklin D. Roosevelt 1 0.960 \n",
"24 01M063 2020 The STAR Academy - P.S.63 1 0.769 \n",
"\n",
" ell_pct swd_pct \n",
"4 0.109 0.223 \n",
"9 0.042 0.392 \n",
"14 0.119 0.218 \n",
"19 0.062 0.392 \n",
"24 0.014 0.279 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ay_2020 = df[df.ay == 2020]\n",
"ay_2020.head()"
]
},
{
"cell_type": "markdown",
"id": "46fac74e",
"metadata": {},
"source": [
"Compound expressions\n",
"----------------------------------\n",
"We can use Boolean operators to make more complex queries. \n",
"`pandas` uses the `&` operator for Boolean **and**\n",
"and the `|` operator for Boolean **or**. Note that you should wrap\n",
"your equality expressions inside of parenethesis.\n",
"\n",
"To write bug-free code, we do not recommend mixing `and` and `or`\n",
"clauses in the same query.\n",
"\n",
"Below are two examples:\n",
"\n",
"- `df[(df.district == 13) & (df.ay == 2020)]`\n",
" find data forschools in district 13 _and_ academic year 2020-21\n",
"- `df[(df.poverty_pct > .9) | (df.ell_pct >= .4)]`\n",
" find school data where the school's poverty percent is greater than 90% _or_\n",
" the percent of studnts classified as English Language Learners is greater\n",
" than or equal to 40%\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "df67708e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" dbn | \n",
" ay | \n",
" school_name | \n",
" district | \n",
" poverty_pct | \n",
" ell_pct | \n",
" swd_pct | \n",
"
\n",
" \n",
" \n",
" \n",
" 1556 | \n",
" 13K869 | \n",
" 2020 | \n",
" District 13 PRE-K Center | \n",
" 13 | \n",
" 0.076 | \n",
" 0.000 | \n",
" 0.015 | \n",
"
\n",
" \n",
" 2129 | \n",
" 13K915 | \n",
" 2020 | \n",
" I.S. 915 | \n",
" 13 | \n",
" 0.309 | \n",
" 0.023 | \n",
" 0.201 | \n",
"
\n",
" \n",
" 2130 | \n",
" 13K915 | \n",
" 2020 | \n",
" I.S. 915 | \n",
" 13 | \n",
" 0.309 | \n",
" 0.023 | \n",
" 0.201 | \n",
"
\n",
" \n",
" 2131 | \n",
" 13K915 | \n",
" 2020 | \n",
" I.S. 915 | \n",
" 13 | \n",
" 0.309 | \n",
" 0.023 | \n",
" 0.201 | \n",
"
\n",
" \n",
" 2132 | \n",
" 13K915 | \n",
" 2020 | \n",
" I.S. 915 | \n",
" 13 | \n",
" 0.309 | \n",
" 0.023 | \n",
" 0.201 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" dbn ay school_name district poverty_pct ell_pct \\\n",
"1556 13K869 2020 District 13 PRE-K Center 13 0.076 0.000 \n",
"2129 13K915 2020 I.S. 915 13 0.309 0.023 \n",
"2130 13K915 2020 I.S. 915 13 0.309 0.023 \n",
"2131 13K915 2020 I.S. 915 13 0.309 0.023 \n",
"2132 13K915 2020 I.S. 915 13 0.309 0.023 \n",
"\n",
" swd_pct \n",
"1556 0.015 \n",
"2129 0.201 \n",
"2130 0.201 \n",
"2131 0.201 \n",
"2132 0.201 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[(df.district == 13) & (df.ay == 2020)].head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "428977ec",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" dbn | \n",
" ay | \n",
" school_name | \n",
" district | \n",
" poverty_pct | \n",
" ell_pct | \n",
" swd_pct | \n",
"
\n",
" \n",
" \n",
" \n",
" 15 | \n",
" 01M034 | \n",
" 2016 | \n",
" P.S. 034 Franklin D. Roosevelt | \n",
" 1 | \n",
" 0.96 | \n",
" 0.077 | \n",
" 0.371 | \n",
"
\n",
" \n",
" 16 | \n",
" 01M034 | \n",
" 2017 | \n",
" P.S. 034 Franklin D. Roosevelt | \n",
" 1 | \n",
" 0.96 | \n",
" 0.075 | \n",
" 0.372 | \n",
"
\n",
" \n",
" 17 | \n",
" 01M034 | \n",
" 2018 | \n",
" P.S. 034 Franklin D. Roosevelt | \n",
" 1 | \n",
" 0.96 | \n",
" 0.072 | \n",
" 0.384 | \n",
"
\n",
" \n",
" 18 | \n",
" 01M034 | \n",
" 2019 | \n",
" P.S. 034 Franklin D. Roosevelt | \n",
" 1 | \n",
" 0.96 | \n",
" 0.057 | \n",
" 0.395 | \n",
"
\n",
" \n",
" 19 | \n",
" 01M034 | \n",
" 2020 | \n",
" P.S. 034 Franklin D. Roosevelt | \n",
" 1 | \n",
" 0.96 | \n",
" 0.062 | \n",
" 0.392 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" dbn ay school_name district poverty_pct \\\n",
"15 01M034 2016 P.S. 034 Franklin D. Roosevelt 1 0.96 \n",
"16 01M034 2017 P.S. 034 Franklin D. Roosevelt 1 0.96 \n",
"17 01M034 2018 P.S. 034 Franklin D. Roosevelt 1 0.96 \n",
"18 01M034 2019 P.S. 034 Franklin D. Roosevelt 1 0.96 \n",
"19 01M034 2020 P.S. 034 Franklin D. Roosevelt 1 0.96 \n",
"\n",
" ell_pct swd_pct \n",
"15 0.077 0.371 \n",
"16 0.075 0.372 \n",
"17 0.072 0.384 \n",
"18 0.057 0.395 \n",
"19 0.062 0.392 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[(df.poverty_pct > .9) | (df.ell_pct >= .4)].head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "8042c44a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" dbn | \n",
" ay | \n",
" school_name | \n",
" district | \n",
" poverty_pct | \n",
" ell_pct | \n",
" swd_pct | \n",
"
\n",
" \n",
" \n",
" \n",
" 19 | \n",
" 01M034 | \n",
" 2020 | \n",
" P.S. 034 Franklin D. Roosevelt | \n",
" 1 | \n",
" 0.960 | \n",
" 0.062 | \n",
" 0.392 | \n",
"
\n",
" \n",
" 29 | \n",
" 01M064 | \n",
" 2020 | \n",
" P.S. 064 Robert Simon | \n",
" 1 | \n",
" 0.914 | \n",
" 0.018 | \n",
" 0.264 | \n",
"
\n",
" \n",
" 39 | \n",
" 01M134 | \n",
" 2020 | \n",
" P.S. 134 Henrietta Szold | \n",
" 1 | \n",
" 0.949 | \n",
" 0.055 | \n",
" 0.419 | \n",
"
\n",
" \n",
" 44 | \n",
" 01M140 | \n",
" 2020 | \n",
" P.S. 140 Nathan Straus | \n",
" 1 | \n",
" 0.950 | \n",
" 0.073 | \n",
" 0.372 | \n",
"
\n",
" \n",
" 49 | \n",
" 01M142 | \n",
" 2020 | \n",
" P.S. 142 Amalia Castro | \n",
" 1 | \n",
" 0.946 | \n",
" 0.066 | \n",
" 0.284 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" dbn ay school_name district poverty_pct \\\n",
"19 01M034 2020 P.S. 034 Franklin D. Roosevelt 1 0.960 \n",
"29 01M064 2020 P.S. 064 Robert Simon 1 0.914 \n",
"39 01M134 2020 P.S. 134 Henrietta Szold 1 0.949 \n",
"44 01M140 2020 P.S. 140 Nathan Straus 1 0.950 \n",
"49 01M142 2020 P.S. 142 Amalia Castro 1 0.946 \n",
"\n",
" ell_pct swd_pct \n",
"19 0.062 0.392 \n",
"29 0.018 0.264 \n",
"39 0.055 0.419 \n",
"44 0.073 0.372 \n",
"49 0.066 0.284 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# this works, but it is confusing and error prone\n",
"df[(df.ay == 2020) & ((df.poverty_pct > .9) | (df.ell_pct >= .4))].head()\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "762115c0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" dbn | \n",
" ay | \n",
" school_name | \n",
" district | \n",
" poverty_pct | \n",
" ell_pct | \n",
" swd_pct | \n",
"
\n",
" \n",
" \n",
" \n",
" 19 | \n",
" 01M034 | \n",
" 2020 | \n",
" P.S. 034 Franklin D. Roosevelt | \n",
" 1 | \n",
" 0.960 | \n",
" 0.062 | \n",
" 0.392 | \n",
"
\n",
" \n",
" 29 | \n",
" 01M064 | \n",
" 2020 | \n",
" P.S. 064 Robert Simon | \n",
" 1 | \n",
" 0.914 | \n",
" 0.018 | \n",
" 0.264 | \n",
"
\n",
" \n",
" 39 | \n",
" 01M134 | \n",
" 2020 | \n",
" P.S. 134 Henrietta Szold | \n",
" 1 | \n",
" 0.949 | \n",
" 0.055 | \n",
" 0.419 | \n",
"
\n",
" \n",
" 44 | \n",
" 01M140 | \n",
" 2020 | \n",
" P.S. 140 Nathan Straus | \n",
" 1 | \n",
" 0.950 | \n",
" 0.073 | \n",
" 0.372 | \n",
"
\n",
" \n",
" 49 | \n",
" 01M142 | \n",
" 2020 | \n",
" P.S. 142 Amalia Castro | \n",
" 1 | \n",
" 0.946 | \n",
" 0.066 | \n",
" 0.284 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" dbn ay school_name district poverty_pct \\\n",
"19 01M034 2020 P.S. 034 Franklin D. Roosevelt 1 0.960 \n",
"29 01M064 2020 P.S. 064 Robert Simon 1 0.914 \n",
"39 01M134 2020 P.S. 134 Henrietta Szold 1 0.949 \n",
"44 01M140 2020 P.S. 140 Nathan Straus 1 0.950 \n",
"49 01M142 2020 P.S. 142 Amalia Castro 1 0.946 \n",
"\n",
" ell_pct swd_pct \n",
"19 0.062 0.392 \n",
"29 0.018 0.264 \n",
"39 0.055 0.419 \n",
"44 0.073 0.372 \n",
"49 0.066 0.284 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# rather than mixing | and &, we can write our code more clearly in two lines\n",
"data = df[df.ay == 2020]\n",
"data = data[(data.poverty_pct > .9) | (data.ell_pct >= .4)]\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"id": "194e8ede",
"metadata": {},
"source": [
"Finding a subset of data\n",
"-----------------------------------\n",
"`DataFrame` has a usefule function called `isin()` that lets us filter data on a `Series`\n",
"if the value is in a set of values. First, we'll do a simple example where we find\n",
"all of the schools in a district `[3,8,15]`"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "e695ed78",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" dbn | \n",
" district | \n",
" school_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 882 | \n",
" 03M009 | \n",
" 3 | \n",
" P.S. 009 Sarah Anderson | \n",
"
\n",
" \n",
" 883 | \n",
" 03M009 | \n",
" 3 | \n",
" P.S. 009 Sarah Anderson | \n",
"
\n",
" \n",
" 884 | \n",
" 03M009 | \n",
" 3 | \n",
" P.S. 009 Sarah Anderson | \n",
"
\n",
" \n",
" 885 | \n",
" 03M009 | \n",
" 3 | \n",
" P.S. 009 Sarah Anderson | \n",
"
\n",
" \n",
" 886 | \n",
" 03M009 | \n",
" 3 | \n",
" P.S. 009 Sarah Anderson | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 2142 | \n",
" 08X562 | \n",
" 8 | \n",
" Blueprint Middle School | \n",
"
\n",
" \n",
" 2143 | \n",
" 08X583 | \n",
" 8 | \n",
" P.S. 583 | \n",
"
\n",
" \n",
" 2144 | \n",
" 08X583 | \n",
" 8 | \n",
" P.S. 583 | \n",
"
\n",
" \n",
" 2145 | \n",
" 08X583 | \n",
" 8 | \n",
" P.S. 583 | \n",
"
\n",
" \n",
" 2146 | \n",
" 08X583 | \n",
" 8 | \n",
" P.S. 583 | \n",
"
\n",
" \n",
"
\n",
"
675 rows × 3 columns
\n",
"
"
],
"text/plain": [
" dbn district school_name\n",
"882 03M009 3 P.S. 009 Sarah Anderson\n",
"883 03M009 3 P.S. 009 Sarah Anderson\n",
"884 03M009 3 P.S. 009 Sarah Anderson\n",
"885 03M009 3 P.S. 009 Sarah Anderson\n",
"886 03M009 3 P.S. 009 Sarah Anderson\n",
"... ... ... ...\n",
"2142 08X562 8 Blueprint Middle School\n",
"2143 08X583 8 P.S. 583\n",
"2144 08X583 8 P.S. 583\n",
"2145 08X583 8 P.S. 583\n",
"2146 08X583 8 P.S. 583\n",
"\n",
"[675 rows x 3 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"districts = [3, 8, 5]\n",
"data = df[df.district.isin(districts)]\n",
"data[[\"dbn\", \"district\", \"school_name\"]]"
]
},
{
"cell_type": "markdown",
"id": "5ac71eb3",
"metadata": {},
"source": [
"Using `drop_duplicates`\n",
"-----------------------------------------------------\n",
"In the display of the table above, since we chose not to show all of the columns, we have repeated columns. Sometimes we will want just the unique results. We saw that `unique()` works for a `Series`. It doesn't work for a dataframe, however.\n",
"\n",
"Luckily, we have `drop_duplicates()` which serves a similar function...see:\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "3af8ba90",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" dbn | \n",
" district | \n",
" school_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 882 | \n",
" 03M009 | \n",
" 3 | \n",
" P.S. 009 Sarah Anderson | \n",
"
\n",
" \n",
" 887 | \n",
" 03M054 | \n",
" 3 | \n",
" J.H.S. 054 Booker T. Washington | \n",
"
\n",
" \n",
" 892 | \n",
" 03M075 | \n",
" 3 | \n",
" P.S. 075 Emily Dickinson | \n",
"
\n",
" \n",
" 897 | \n",
" 03M076 | \n",
" 3 | \n",
" P.S. 076 A. Philip Randolph | \n",
"
\n",
" \n",
" 902 | \n",
" 03M084 | \n",
" 3 | \n",
" P.S. 084 Lillian Weber | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 2119 | \n",
" 08X558 | \n",
" 8 | \n",
" Westchester Square Academy | \n",
"
\n",
" \n",
" 2124 | \n",
" 08X559 | \n",
" 8 | \n",
" School for Tourism and Hospitality | \n",
"
\n",
" \n",
" 2133 | \n",
" 08X561 | \n",
" 8 | \n",
" Bronx Compass High School | \n",
"
\n",
" \n",
" 2138 | \n",
" 08X562 | \n",
" 8 | \n",
" Blueprint Middle School | \n",
"
\n",
" \n",
" 2143 | \n",
" 08X583 | \n",
" 8 | \n",
" P.S. 583 | \n",
"
\n",
" \n",
"
\n",
"
124 rows × 3 columns
\n",
"
"
],
"text/plain": [
" dbn district school_name\n",
"882 03M009 3 P.S. 009 Sarah Anderson\n",
"887 03M054 3 J.H.S. 054 Booker T. Washington\n",
"892 03M075 3 P.S. 075 Emily Dickinson\n",
"897 03M076 3 P.S. 076 A. Philip Randolph\n",
"902 03M084 3 P.S. 084 Lillian Weber\n",
"... ... ... ...\n",
"2119 08X558 8 Westchester Square Academy\n",
"2124 08X559 8 School for Tourism and Hospitality\n",
"2133 08X561 8 Bronx Compass High School\n",
"2138 08X562 8 Blueprint Middle School\n",
"2143 08X583 8 P.S. 583\n",
"\n",
"[124 rows x 3 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[[\"dbn\", \"district\", \"school_name\"]].drop_duplicates()"
]
},
{
"cell_type": "markdown",
"id": "9f4ad0e5",
"metadata": {},
"source": [
"Filtering missing data\n",
"-------------------------------\n",
"You will not always have complete data, and sometimes you need to find rows with missing records. You may want to update them with some default or derived values, or, you may simpley want to exclude them from your data.\n",
"\n",
"In this example, we will use `nycschools` to load the grade 3-8 ELA NYS exams. We will then filter out all of the rows that don't have a score. Some rows are null because not enough students took the exam. Others may be missing data because their students were exempt from the exam or other reasons."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "a8270f1c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 430,025 total records in ela\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" dbn | \n",
" grade | \n",
" category | \n",
" number_tested | \n",
" mean_scale_score | \n",
" level_1_n | \n",
" level_1_pct | \n",
" level_2_n | \n",
" level_2_pct | \n",
" level_3_n | \n",
" level_3_pct | \n",
" level_4_n | \n",
" level_4_pct | \n",
" level_3_4_n | \n",
" level_3_4_pct | \n",
" test_year | \n",
" ay | \n",
" charter | \n",
" school_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 01M015 | \n",
" 3 | \n",
" All Students | \n",
" 27 | \n",
" 289.296295 | \n",
" 14.0 | \n",
" 0.518519 | \n",
" 11.0 | \n",
" 0.407407 | \n",
" 2.0 | \n",
" 0.074074 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 2.0 | \n",
" 0.074074 | \n",
" 2013 | \n",
" 2012 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 7 | \n",
" 01M015 | \n",
" 4 | \n",
" All Students | \n",
" 20 | \n",
" 277.649994 | \n",
" 8.0 | \n",
" 0.400000 | \n",
" 11.0 | \n",
" 0.550000 | \n",
" 1.0 | \n",
" 0.050000 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.050000 | \n",
" 2013 | \n",
" 2012 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 14 | \n",
" 01M015 | \n",
" 5 | \n",
" All Students | \n",
" 24 | \n",
" 283.958344 | \n",
" 12.0 | \n",
" 0.500000 | \n",
" 11.0 | \n",
" 0.458333 | \n",
" 1.0 | \n",
" 0.041667 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.041667 | \n",
" 2013 | \n",
" 2012 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 21 | \n",
" 01M015 | \n",
" All Grades | \n",
" All Students | \n",
" 71 | \n",
" 284.211273 | \n",
" 34.0 | \n",
" 0.478873 | \n",
" 33.0 | \n",
" 0.464789 | \n",
" 4.0 | \n",
" 0.056338 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 4.0 | \n",
" 0.056338 | \n",
" 2013 | \n",
" 2012 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 32826 | \n",
" 01M015 | \n",
" 3 | \n",
" Not SWD | \n",
" 19 | \n",
" 287.157898 | \n",
" 11.0 | \n",
" 0.578947 | \n",
" 8.0 | \n",
" 0.421053 | \n",
" 0.0 | \n",
" 0.000000 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.000000 | \n",
" 2013 | \n",
" 2012 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" dbn grade category number_tested mean_scale_score \\\n",
"0 01M015 3 All Students 27 289.296295 \n",
"7 01M015 4 All Students 20 277.649994 \n",
"14 01M015 5 All Students 24 283.958344 \n",
"21 01M015 All Grades All Students 71 284.211273 \n",
"32826 01M015 3 Not SWD 19 287.157898 \n",
"\n",
" level_1_n level_1_pct level_2_n level_2_pct level_3_n level_3_pct \\\n",
"0 14.0 0.518519 11.0 0.407407 2.0 0.074074 \n",
"7 8.0 0.400000 11.0 0.550000 1.0 0.050000 \n",
"14 12.0 0.500000 11.0 0.458333 1.0 0.041667 \n",
"21 34.0 0.478873 33.0 0.464789 4.0 0.056338 \n",
"32826 11.0 0.578947 8.0 0.421053 0.0 0.000000 \n",
"\n",
" level_4_n level_4_pct level_3_4_n level_3_4_pct test_year ay \\\n",
"0 0.0 0.0 2.0 0.074074 2013 2012 \n",
"7 0.0 0.0 1.0 0.050000 2013 2012 \n",
"14 0.0 0.0 1.0 0.041667 2013 2012 \n",
"21 0.0 0.0 4.0 0.056338 2013 2012 \n",
"32826 0.0 0.0 0.0 0.000000 2013 2012 \n",
"\n",
" charter school_name \n",
"0 0 NaN \n",
"7 0 NaN \n",
"14 0 NaN \n",
"21 0 NaN \n",
"32826 0 NaN "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from nycschools import exams\n",
"ela = exams.load_ela()\n",
"print(f\"There are {len(ela):,} total records in ela\")\n",
"ela.head()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "639abb57",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 90,576 records in ela with null values for mean_scale_score. NaN means 'Not a Number'\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" dbn | \n",
" grade | \n",
" category | \n",
" number_tested | \n",
" mean_scale_score | \n",
" level_1_n | \n",
" level_1_pct | \n",
" level_2_n | \n",
" level_2_pct | \n",
" level_3_n | \n",
" level_3_pct | \n",
" level_4_n | \n",
" level_4_pct | \n",
" level_3_4_n | \n",
" level_3_4_pct | \n",
" test_year | \n",
" ay | \n",
" charter | \n",
" school_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 97989 | \n",
" 01M015 | \n",
" 3 | \n",
" Asian | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2013 | \n",
" 2012 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 97995 | \n",
" 01M015 | \n",
" 4 | \n",
" Asian | \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2013 | \n",
" 2012 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 98001 | \n",
" 01M015 | \n",
" 5 | \n",
" Asian | \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2013 | \n",
" 2012 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 98007 | \n",
" 01M015 | \n",
" All Grades | \n",
" Asian | \n",
" 5 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2013 | \n",
" 2012 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 98014 | \n",
" 01M015 | \n",
" 3 | \n",
" Black | \n",
" 11 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2013 | \n",
" 2012 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 343977 | \n",
" 32K562 | \n",
" 6 | \n",
" Not Econ Disadv | \n",
" 3 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2019 | \n",
" 2018 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" 427666 | \n",
" 84M481 | \n",
" 5 | \n",
" All Students | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2013 | \n",
" 2012 | \n",
" 1 | \n",
" DEMOCRACY PREP HARLEM CHARTER SCHOOL | \n",
"
\n",
" \n",
" 427675 | \n",
" 84M481 | \n",
" 7 | \n",
" All Students | \n",
" 110 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2013 | \n",
" 2012 | \n",
" 1 | \n",
" DEMOCRACY PREP HARLEM CHARTER SCHOOL | \n",
"
\n",
" \n",
" 429139 | \n",
" 84X394 | \n",
" 5 | \n",
" All Students | \n",
" 31 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2014 | \n",
" 2013 | \n",
" 1 | \n",
" MOTT HAVEN ACADEMY CHARTER SCHOOL | \n",
"
\n",
" \n",
" 429145 | \n",
" 84X394 | \n",
" 6 | \n",
" All Students | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2014 | \n",
" 2013 | \n",
" 1 | \n",
" MOTT HAVEN ACADEMY CHARTER SCHOOL | \n",
"
\n",
" \n",
"
\n",
"
90576 rows × 19 columns
\n",
"
"
],
"text/plain": [
" dbn grade category number_tested mean_scale_score \\\n",
"97989 01M015 3 Asian 1 NaN \n",
"97995 01M015 4 Asian 2 NaN \n",
"98001 01M015 5 Asian 2 NaN \n",
"98007 01M015 All Grades Asian 5 NaN \n",
"98014 01M015 3 Black 11 NaN \n",
"... ... ... ... ... ... \n",
"343977 32K562 6 Not Econ Disadv 3 NaN \n",
"427666 84M481 5 All Students 1 NaN \n",
"427675 84M481 7 All Students 110 NaN \n",
"429139 84X394 5 All Students 31 NaN \n",
"429145 84X394 6 All Students 1 NaN \n",
"\n",
" level_1_n level_1_pct level_2_n level_2_pct level_3_n \\\n",
"97989 NaN NaN NaN NaN NaN \n",
"97995 NaN NaN NaN NaN NaN \n",
"98001 NaN NaN NaN NaN NaN \n",
"98007 NaN NaN NaN NaN NaN \n",
"98014 NaN NaN NaN NaN NaN \n",
"... ... ... ... ... ... \n",
"343977 NaN NaN NaN NaN NaN \n",
"427666 NaN NaN NaN NaN NaN \n",
"427675 NaN NaN NaN NaN NaN \n",
"429139 NaN NaN NaN NaN NaN \n",
"429145 NaN NaN NaN NaN NaN \n",
"\n",
" level_3_pct level_4_n level_4_pct level_3_4_n level_3_4_pct \\\n",
"97989 NaN NaN NaN NaN NaN \n",
"97995 NaN NaN NaN NaN NaN \n",
"98001 NaN NaN NaN NaN NaN \n",
"98007 NaN NaN NaN NaN NaN \n",
"98014 NaN NaN NaN NaN NaN \n",
"... ... ... ... ... ... \n",
"343977 NaN NaN NaN NaN NaN \n",
"427666 NaN NaN NaN NaN NaN \n",
"427675 NaN NaN NaN NaN NaN \n",
"429139 NaN NaN NaN NaN NaN \n",
"429145 NaN NaN NaN NaN NaN \n",
"\n",
" test_year ay charter school_name \n",
"97989 2013 2012 0 NaN \n",
"97995 2013 2012 0 NaN \n",
"98001 2013 2012 0 NaN \n",
"98007 2013 2012 0 NaN \n",
"98014 2013 2012 0 NaN \n",
"... ... ... ... ... \n",
"343977 2019 2018 0 NaN \n",
"427666 2013 2012 1 DEMOCRACY PREP HARLEM CHARTER SCHOOL \n",
"427675 2013 2012 1 DEMOCRACY PREP HARLEM CHARTER SCHOOL \n",
"429139 2014 2013 1 MOTT HAVEN ACADEMY CHARTER SCHOOL \n",
"429145 2014 2013 1 MOTT HAVEN ACADEMY CHARTER SCHOOL \n",
"\n",
"[90576 rows x 19 columns]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"missing = ela[ela.mean_scale_score.isnull()]\n",
"print(f\"There are {len(missing):,} records in ela with null values for mean_scale_score. NaN means 'Not a Number'\")\n",
"\n",
"missing"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "school-data",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"vscode": {
"interpreter": {
"hash": "c853444e20c489e5b96d8e1a4533affead1d94f1ba40ff9ef08cffb9c8ee794e"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}