{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "e81cdfdf", "metadata": {}, "source": [ "Querying & Selecting\n", "====================\n", "\n", "In this Notebook we go over some of the ways of querying data in a dataframe.\n", "\n", "- `[]` slice notation for subsets of data\n", "- `query()` method of `DataFrame` to search data\n", "- using `isin()`, `isnull()`\n", "- drop_duplicates()\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "55795e7b", "metadata": {}, "outputs": [], "source": [ "# import schools from the nycschool package\n", "from nycschools import schools, exams\n", "\n", "# load the demographic data into a `DataFrame` called df\n", "df = schools.load_school_demographics()" ] }, { "cell_type": "code", "execution_count": 2, "id": "59e7529a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dbnayschool_namedistrictpoverty_pctell_pctswd_pct
001M0152016P.S. 015 Roberto Clemente10.8540.0670.287000
101M0152017P.S. 015 Roberto Clemente10.8470.0420.258000
201M0152018P.S. 015 Roberto Clemente10.8450.0460.224000
301M0152019P.S. 015 Roberto Clemente10.8160.0890.242000
401M0152020P.S. 015 Roberto Clemente10.8190.1090.223000
........................
999684X7302016Bronx Charter School for the Arts840.7340.1590.209375
999784X7302017Bronx Charter School for the Arts840.8220.1820.216561
999884X7302018Bronx Charter School for the Arts840.8440.1650.239535
999984X7302019Bronx Charter School for the Arts840.8660.1320.223709
1000084X7302020Bronx Charter School for the Arts840.8640.1250.244409
\n", "

10001 rows × 7 columns

\n", "
" ], "text/plain": [ " dbn ay school_name district poverty_pct \\\n", "0 01M015 2016 P.S. 015 Roberto Clemente 1 0.854 \n", "1 01M015 2017 P.S. 015 Roberto Clemente 1 0.847 \n", "2 01M015 2018 P.S. 015 Roberto Clemente 1 0.845 \n", "3 01M015 2019 P.S. 015 Roberto Clemente 1 0.816 \n", "4 01M015 2020 P.S. 015 Roberto Clemente 1 0.819 \n", "... ... ... ... ... ... \n", "9996 84X730 2016 Bronx Charter School for the Arts 84 0.734 \n", "9997 84X730 2017 Bronx Charter School for the Arts 84 0.822 \n", "9998 84X730 2018 Bronx Charter School for the Arts 84 0.844 \n", "9999 84X730 2019 Bronx Charter School for the Arts 84 0.866 \n", "10000 84X730 2020 Bronx Charter School for the Arts 84 0.864 \n", "\n", " ell_pct swd_pct \n", "0 0.067 0.287000 \n", "1 0.042 0.258000 \n", "2 0.046 0.224000 \n", "3 0.089 0.242000 \n", "4 0.109 0.223000 \n", "... ... ... \n", "9996 0.159 0.209375 \n", "9997 0.182 0.216561 \n", "9998 0.165 0.239535 \n", "9999 0.132 0.223709 \n", "10000 0.125 0.244409 \n", "\n", "[10001 rows x 7 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we already saw how to get a subset of columns\n", "# let's get just these columns from out data\n", "cols = [\"dbn\", \"ay\", \"school_name\", \"district\", \"poverty_pct\", \"ell_pct\", \"swd_pct\"]\n", "df = df[cols]\n", "df" ] }, { "cell_type": "markdown", "id": "af71b571", "metadata": {}, "source": [ "Selecting just one year\n", "---------------------------------\n", "We can see that our data contains multiple years of data for each school.\n", "We can filter or query the data to just get a single year.\n", "\n", "To do this, we will use the slice notation similary to above, however,\n", "instead of a list of columns, we have a Boolean expression using\n", "one of the columns in our data set.\n", "\n", "`df[df.ay == 2020]` returns only the rows where `ay` equals `2020`\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "4fe49f51", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dbnayschool_namedistrictpoverty_pctell_pctswd_pct
401M0152020P.S. 015 Roberto Clemente10.8190.1090.223
901M0192020P.S. 019 Asher Levy10.7120.0420.392
1401M0202020P.S. 020 Anna Silver10.7090.1190.218
1901M0342020P.S. 034 Franklin D. Roosevelt10.9600.0620.392
2401M0632020The STAR Academy - P.S.6310.7690.0140.279
\n", "
" ], "text/plain": [ " dbn ay school_name district poverty_pct \\\n", "4 01M015 2020 P.S. 015 Roberto Clemente 1 0.819 \n", "9 01M019 2020 P.S. 019 Asher Levy 1 0.712 \n", "14 01M020 2020 P.S. 020 Anna Silver 1 0.709 \n", "19 01M034 2020 P.S. 034 Franklin D. Roosevelt 1 0.960 \n", "24 01M063 2020 The STAR Academy - P.S.63 1 0.769 \n", "\n", " ell_pct swd_pct \n", "4 0.109 0.223 \n", "9 0.042 0.392 \n", "14 0.119 0.218 \n", "19 0.062 0.392 \n", "24 0.014 0.279 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ay_2020 = df[df.ay == 2020]\n", "ay_2020.head()" ] }, { "cell_type": "markdown", "id": "46fac74e", "metadata": {}, "source": [ "Compound expressions\n", "----------------------------------\n", "We can use Boolean operators to make more complex queries. \n", "`pandas` uses the `&` operator for Boolean **and**\n", "and the `|` operator for Boolean **or**. Note that you should wrap\n", "your equality expressions inside of parenethesis.\n", "\n", "To write bug-free code, we do not recommend mixing `and` and `or`\n", "clauses in the same query.\n", "\n", "Below are two examples:\n", "\n", "- `df[(df.district == 13) & (df.ay == 2020)]`\n", " find data forschools in district 13 _and_ academic year 2020-21\n", "- `df[(df.poverty_pct > .9) | (df.ell_pct >= .4)]`\n", " find school data where the school's poverty percent is greater than 90% _or_\n", " the percent of studnts classified as English Language Learners is greater\n", " than or equal to 40%\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "df67708e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dbnayschool_namedistrictpoverty_pctell_pctswd_pct
155613K8692020District 13 PRE-K Center130.0760.0000.015
212913K9152020I.S. 915130.3090.0230.201
213013K9152020I.S. 915130.3090.0230.201
213113K9152020I.S. 915130.3090.0230.201
213213K9152020I.S. 915130.3090.0230.201
\n", "
" ], "text/plain": [ " dbn ay school_name district poverty_pct ell_pct \\\n", "1556 13K869 2020 District 13 PRE-K Center 13 0.076 0.000 \n", "2129 13K915 2020 I.S. 915 13 0.309 0.023 \n", "2130 13K915 2020 I.S. 915 13 0.309 0.023 \n", "2131 13K915 2020 I.S. 915 13 0.309 0.023 \n", "2132 13K915 2020 I.S. 915 13 0.309 0.023 \n", "\n", " swd_pct \n", "1556 0.015 \n", "2129 0.201 \n", "2130 0.201 \n", "2131 0.201 \n", "2132 0.201 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[(df.district == 13) & (df.ay == 2020)].head()" ] }, { "cell_type": "code", "execution_count": 5, "id": "428977ec", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dbnayschool_namedistrictpoverty_pctell_pctswd_pct
1501M0342016P.S. 034 Franklin D. Roosevelt10.960.0770.371
1601M0342017P.S. 034 Franklin D. Roosevelt10.960.0750.372
1701M0342018P.S. 034 Franklin D. Roosevelt10.960.0720.384
1801M0342019P.S. 034 Franklin D. Roosevelt10.960.0570.395
1901M0342020P.S. 034 Franklin D. Roosevelt10.960.0620.392
\n", "
" ], "text/plain": [ " dbn ay school_name district poverty_pct \\\n", "15 01M034 2016 P.S. 034 Franklin D. Roosevelt 1 0.96 \n", "16 01M034 2017 P.S. 034 Franklin D. Roosevelt 1 0.96 \n", "17 01M034 2018 P.S. 034 Franklin D. Roosevelt 1 0.96 \n", "18 01M034 2019 P.S. 034 Franklin D. Roosevelt 1 0.96 \n", "19 01M034 2020 P.S. 034 Franklin D. Roosevelt 1 0.96 \n", "\n", " ell_pct swd_pct \n", "15 0.077 0.371 \n", "16 0.075 0.372 \n", "17 0.072 0.384 \n", "18 0.057 0.395 \n", "19 0.062 0.392 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[(df.poverty_pct > .9) | (df.ell_pct >= .4)].head()" ] }, { "cell_type": "code", "execution_count": 6, "id": "8042c44a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dbnayschool_namedistrictpoverty_pctell_pctswd_pct
1901M0342020P.S. 034 Franklin D. Roosevelt10.9600.0620.392
2901M0642020P.S. 064 Robert Simon10.9140.0180.264
3901M1342020P.S. 134 Henrietta Szold10.9490.0550.419
4401M1402020P.S. 140 Nathan Straus10.9500.0730.372
4901M1422020P.S. 142 Amalia Castro10.9460.0660.284
\n", "
" ], "text/plain": [ " dbn ay school_name district poverty_pct \\\n", "19 01M034 2020 P.S. 034 Franklin D. Roosevelt 1 0.960 \n", "29 01M064 2020 P.S. 064 Robert Simon 1 0.914 \n", "39 01M134 2020 P.S. 134 Henrietta Szold 1 0.949 \n", "44 01M140 2020 P.S. 140 Nathan Straus 1 0.950 \n", "49 01M142 2020 P.S. 142 Amalia Castro 1 0.946 \n", "\n", " ell_pct swd_pct \n", "19 0.062 0.392 \n", "29 0.018 0.264 \n", "39 0.055 0.419 \n", "44 0.073 0.372 \n", "49 0.066 0.284 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# this works, but it is confusing and error prone\n", "df[(df.ay == 2020) & ((df.poverty_pct > .9) | (df.ell_pct >= .4))].head()\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "762115c0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dbnayschool_namedistrictpoverty_pctell_pctswd_pct
1901M0342020P.S. 034 Franklin D. Roosevelt10.9600.0620.392
2901M0642020P.S. 064 Robert Simon10.9140.0180.264
3901M1342020P.S. 134 Henrietta Szold10.9490.0550.419
4401M1402020P.S. 140 Nathan Straus10.9500.0730.372
4901M1422020P.S. 142 Amalia Castro10.9460.0660.284
\n", "
" ], "text/plain": [ " dbn ay school_name district poverty_pct \\\n", "19 01M034 2020 P.S. 034 Franklin D. Roosevelt 1 0.960 \n", "29 01M064 2020 P.S. 064 Robert Simon 1 0.914 \n", "39 01M134 2020 P.S. 134 Henrietta Szold 1 0.949 \n", "44 01M140 2020 P.S. 140 Nathan Straus 1 0.950 \n", "49 01M142 2020 P.S. 142 Amalia Castro 1 0.946 \n", "\n", " ell_pct swd_pct \n", "19 0.062 0.392 \n", "29 0.018 0.264 \n", "39 0.055 0.419 \n", "44 0.073 0.372 \n", "49 0.066 0.284 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# rather than mixing | and &, we can write our code more clearly in two lines\n", "data = df[df.ay == 2020]\n", "data = data[(data.poverty_pct > .9) | (data.ell_pct >= .4)]\n", "data.head()" ] }, { "cell_type": "markdown", "id": "194e8ede", "metadata": {}, "source": [ "Finding a subset of data\n", "-----------------------------------\n", "`DataFrame` has a usefule function called `isin()` that lets us filter data on a `Series`\n", "if the value is in a set of values. First, we'll do a simple example where we find\n", "all of the schools in a district `[3,8,15]`" ] }, { "cell_type": "code", "execution_count": 8, "id": "e695ed78", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dbndistrictschool_name
88203M0093P.S. 009 Sarah Anderson
88303M0093P.S. 009 Sarah Anderson
88403M0093P.S. 009 Sarah Anderson
88503M0093P.S. 009 Sarah Anderson
88603M0093P.S. 009 Sarah Anderson
............
214208X5628Blueprint Middle School
214308X5838P.S. 583
214408X5838P.S. 583
214508X5838P.S. 583
214608X5838P.S. 583
\n", "

675 rows × 3 columns

\n", "
" ], "text/plain": [ " dbn district school_name\n", "882 03M009 3 P.S. 009 Sarah Anderson\n", "883 03M009 3 P.S. 009 Sarah Anderson\n", "884 03M009 3 P.S. 009 Sarah Anderson\n", "885 03M009 3 P.S. 009 Sarah Anderson\n", "886 03M009 3 P.S. 009 Sarah Anderson\n", "... ... ... ...\n", "2142 08X562 8 Blueprint Middle School\n", "2143 08X583 8 P.S. 583\n", "2144 08X583 8 P.S. 583\n", "2145 08X583 8 P.S. 583\n", "2146 08X583 8 P.S. 583\n", "\n", "[675 rows x 3 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "districts = [3, 8, 5]\n", "data = df[df.district.isin(districts)]\n", "data[[\"dbn\", \"district\", \"school_name\"]]" ] }, { "cell_type": "markdown", "id": "5ac71eb3", "metadata": {}, "source": [ "Using `drop_duplicates`\n", "-----------------------------------------------------\n", "In the display of the table above, since we chose not to show all of the columns, we have repeated columns. Sometimes we will want just the unique results. We saw that `unique()` works for a `Series`. It doesn't work for a dataframe, however.\n", "\n", "Luckily, we have `drop_duplicates()` which serves a similar function...see:\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "3af8ba90", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dbndistrictschool_name
88203M0093P.S. 009 Sarah Anderson
88703M0543J.H.S. 054 Booker T. Washington
89203M0753P.S. 075 Emily Dickinson
89703M0763P.S. 076 A. Philip Randolph
90203M0843P.S. 084 Lillian Weber
............
211908X5588Westchester Square Academy
212408X5598School for Tourism and Hospitality
213308X5618Bronx Compass High School
213808X5628Blueprint Middle School
214308X5838P.S. 583
\n", "

124 rows × 3 columns

\n", "
" ], "text/plain": [ " dbn district school_name\n", "882 03M009 3 P.S. 009 Sarah Anderson\n", "887 03M054 3 J.H.S. 054 Booker T. Washington\n", "892 03M075 3 P.S. 075 Emily Dickinson\n", "897 03M076 3 P.S. 076 A. Philip Randolph\n", "902 03M084 3 P.S. 084 Lillian Weber\n", "... ... ... ...\n", "2119 08X558 8 Westchester Square Academy\n", "2124 08X559 8 School for Tourism and Hospitality\n", "2133 08X561 8 Bronx Compass High School\n", "2138 08X562 8 Blueprint Middle School\n", "2143 08X583 8 P.S. 583\n", "\n", "[124 rows x 3 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[[\"dbn\", \"district\", \"school_name\"]].drop_duplicates()" ] }, { "cell_type": "markdown", "id": "9f4ad0e5", "metadata": {}, "source": [ "Filtering missing data\n", "-------------------------------\n", "You will not always have complete data, and sometimes you need to find rows with missing records. You may want to update them with some default or derived values, or, you may simpley want to exclude them from your data.\n", "\n", "In this example, we will use `nycschools` to load the grade 3-8 ELA NYS exams. We will then filter out all of the rows that don't have a score. Some rows are null because not enough students took the exam. Others may be missing data because their students were exempt from the exam or other reasons." ] }, { "cell_type": "code", "execution_count": 10, "id": "a8270f1c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 430,025 total records in ela\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dbngradecategorynumber_testedmean_scale_scorelevel_1_nlevel_1_pctlevel_2_nlevel_2_pctlevel_3_nlevel_3_pctlevel_4_nlevel_4_pctlevel_3_4_nlevel_3_4_pcttest_yearaycharterschool_name
001M0153All Students27289.29629514.00.51851911.00.4074072.00.0740740.00.02.00.074074201320120NaN
701M0154All Students20277.6499948.00.40000011.00.5500001.00.0500000.00.01.00.050000201320120NaN
1401M0155All Students24283.95834412.00.50000011.00.4583331.00.0416670.00.01.00.041667201320120NaN
2101M015All GradesAll Students71284.21127334.00.47887333.00.4647894.00.0563380.00.04.00.056338201320120NaN
3282601M0153Not SWD19287.15789811.00.5789478.00.4210530.00.0000000.00.00.00.000000201320120NaN
\n", "
" ], "text/plain": [ " dbn grade category number_tested mean_scale_score \\\n", "0 01M015 3 All Students 27 289.296295 \n", "7 01M015 4 All Students 20 277.649994 \n", "14 01M015 5 All Students 24 283.958344 \n", "21 01M015 All Grades All Students 71 284.211273 \n", "32826 01M015 3 Not SWD 19 287.157898 \n", "\n", " level_1_n level_1_pct level_2_n level_2_pct level_3_n level_3_pct \\\n", "0 14.0 0.518519 11.0 0.407407 2.0 0.074074 \n", "7 8.0 0.400000 11.0 0.550000 1.0 0.050000 \n", "14 12.0 0.500000 11.0 0.458333 1.0 0.041667 \n", "21 34.0 0.478873 33.0 0.464789 4.0 0.056338 \n", "32826 11.0 0.578947 8.0 0.421053 0.0 0.000000 \n", "\n", " level_4_n level_4_pct level_3_4_n level_3_4_pct test_year ay \\\n", "0 0.0 0.0 2.0 0.074074 2013 2012 \n", "7 0.0 0.0 1.0 0.050000 2013 2012 \n", "14 0.0 0.0 1.0 0.041667 2013 2012 \n", "21 0.0 0.0 4.0 0.056338 2013 2012 \n", "32826 0.0 0.0 0.0 0.000000 2013 2012 \n", "\n", " charter school_name \n", "0 0 NaN \n", "7 0 NaN \n", "14 0 NaN \n", "21 0 NaN \n", "32826 0 NaN " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from nycschools import exams\n", "ela = exams.load_ela()\n", "print(f\"There are {len(ela):,} total records in ela\")\n", "ela.head()" ] }, { "cell_type": "code", "execution_count": 11, "id": "639abb57", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 90,576 records in ela with null values for mean_scale_score. NaN means 'Not a Number'\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dbngradecategorynumber_testedmean_scale_scorelevel_1_nlevel_1_pctlevel_2_nlevel_2_pctlevel_3_nlevel_3_pctlevel_4_nlevel_4_pctlevel_3_4_nlevel_3_4_pcttest_yearaycharterschool_name
9798901M0153Asian1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN201320120NaN
9799501M0154Asian2NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN201320120NaN
9800101M0155Asian2NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN201320120NaN
9800701M015All GradesAsian5NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN201320120NaN
9801401M0153Black11NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN201320120NaN
............................................................
34397732K5626Not Econ Disadv3NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN201920180NaN
42766684M4815All Students1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN201320121DEMOCRACY PREP HARLEM CHARTER SCHOOL
42767584M4817All Students110NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN201320121DEMOCRACY PREP HARLEM CHARTER SCHOOL
42913984X3945All Students31NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN201420131MOTT HAVEN ACADEMY CHARTER SCHOOL
42914584X3946All Students1NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN201420131MOTT HAVEN ACADEMY CHARTER SCHOOL
\n", "

90576 rows × 19 columns

\n", "
" ], "text/plain": [ " dbn grade category number_tested mean_scale_score \\\n", "97989 01M015 3 Asian 1 NaN \n", "97995 01M015 4 Asian 2 NaN \n", "98001 01M015 5 Asian 2 NaN \n", "98007 01M015 All Grades Asian 5 NaN \n", "98014 01M015 3 Black 11 NaN \n", "... ... ... ... ... ... \n", "343977 32K562 6 Not Econ Disadv 3 NaN \n", "427666 84M481 5 All Students 1 NaN \n", "427675 84M481 7 All Students 110 NaN \n", "429139 84X394 5 All Students 31 NaN \n", "429145 84X394 6 All Students 1 NaN \n", "\n", " level_1_n level_1_pct level_2_n level_2_pct level_3_n \\\n", "97989 NaN NaN NaN NaN NaN \n", "97995 NaN NaN NaN NaN NaN \n", "98001 NaN NaN NaN NaN NaN \n", "98007 NaN NaN NaN NaN NaN \n", "98014 NaN NaN NaN NaN NaN \n", "... ... ... ... ... ... \n", "343977 NaN NaN NaN NaN NaN \n", "427666 NaN NaN NaN NaN NaN \n", "427675 NaN NaN NaN NaN NaN \n", "429139 NaN NaN NaN NaN NaN \n", "429145 NaN NaN NaN NaN NaN \n", "\n", " level_3_pct level_4_n level_4_pct level_3_4_n level_3_4_pct \\\n", "97989 NaN NaN NaN NaN NaN \n", "97995 NaN NaN NaN NaN NaN \n", "98001 NaN NaN NaN NaN NaN \n", "98007 NaN NaN NaN NaN NaN \n", "98014 NaN NaN NaN NaN NaN \n", "... ... ... ... ... ... \n", "343977 NaN NaN NaN NaN NaN \n", "427666 NaN NaN NaN NaN NaN \n", "427675 NaN NaN NaN NaN NaN \n", "429139 NaN NaN NaN NaN NaN \n", "429145 NaN NaN NaN NaN NaN \n", "\n", " test_year ay charter school_name \n", "97989 2013 2012 0 NaN \n", "97995 2013 2012 0 NaN \n", "98001 2013 2012 0 NaN \n", "98007 2013 2012 0 NaN \n", "98014 2013 2012 0 NaN \n", "... ... ... ... ... \n", "343977 2019 2018 0 NaN \n", "427666 2013 2012 1 DEMOCRACY PREP HARLEM CHARTER SCHOOL \n", "427675 2013 2012 1 DEMOCRACY PREP HARLEM CHARTER SCHOOL \n", "429139 2014 2013 1 MOTT HAVEN ACADEMY CHARTER SCHOOL \n", "429145 2014 2013 1 MOTT HAVEN ACADEMY CHARTER SCHOOL \n", "\n", "[90576 rows x 19 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "missing = ela[ela.mean_scale_score.isnull()]\n", "print(f\"There are {len(missing):,} records in ela with null values for mean_scale_score. NaN means 'Not a Number'\")\n", "\n", "missing" ] } ], "metadata": { "kernelspec": { "display_name": "school-data", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "vscode": { "interpreter": { "hash": "c853444e20c489e5b96d8e1a4533affead1d94f1ba40ff9ef08cffb9c8ee794e" } } }, "nbformat": 4, "nbformat_minor": 5 }