nycschools package#

Submodules#

nycschools.budgets module#

get_galaxy_budgets()[source]#: Scrapes the ‘galaxy summary’ budget for all schools from the DOE website.

get_galaxy_summary(dbn, ay, driver)[source]#

Gets the ‘galaxy summary’ budget for a school from the DOE website.

Parameters

dbn (str) – The school’s DBN.
ay (int) – The school year, currently only the most recent school year (2022-2023) is available.
driver (selenium.webdriver.chrome.webdriver.WebDriver) – The Selenium webdriver.

Returns

data – A single DataFrame that combines all of the budget data scraped from the web for the school specified by dbn and ay.

Return type

pandas.DataFrame

load_galaxy_budgets()[source]#

Loads the galaxy budgets from the local cache.

Parameters: None –
Returns: data – A single DataFrame that combines all of the budget data scraped from the web for all schools in the database.
Return type: pandas.DataFrame

open_webdriver()[source]#: Opens a Selenium webdriver using the Chrome/Chromium engine for Selenium. If the environment variables CHROME_PATH and CHROMEDRIVER_PATH are set, they will be used to initialize the webdriver. Otherwise, the webdriver will be initialized using the default installation.

nycschools.cep module#

get_ceps()[source]#

nycschools.class_size module#

get_class_22(url)[source]#: Read class size data for 2022 from the DOE InfoHub Excel file format. This data also contains pupil teacher ratios which are saved in a separate file.

get_class_size()[source]#

Get class size data from the web and cleans class size data for each year that it is available in the datasets. Currently data is available for each year from 2009-2021 excluding the 2020-2021 school year.

Returns: a pandas DataFrame holding school demographic data for all of the schools in the data portal
Return type: DataFrame

get_class_size_year(ay, url)[source]#

load_class_size()[source]#

load_ptr()[source]#

nycschools.dataloader module#

contains_data_files(path)[source]#

Checks to see if the specified path contains the data files required by this application.

Parameters

path (str) – the path to check

Returns

True if the path contains the data files: required by this application, False otherwise

Return type

bool

download_archive(data_dir=None)[source]#

Downloads the school data archive to the local drive and saves it into data_dir then extracts the .7z archive. data_dir now contains the cleaned and compiled school data files.

Parameters: data_dir (str) – the path to the directory where the data files should be saved. If not specified, the package configuration data_dir is used.
Returns: the path to the downloaded file
Return type: str

download_cache()[source]#: Download the data archive and save it to the local drive. This interactive terminal program prompts the user for the location to save the data files. Once the files are downloaded and expanded it attempts to write the path to the data files into the python configuration environment.

download_data()[source]#

find_config_file()[source]#

Looks for virtual environment activation scripts or bash configuration files in known locations.

Returns: the path to the configuration file or None if not found
Return type: str

find_data_dir(config)[source]#

Tries to find an existing data directory populated with data, including searching through mounted google drive if the colab package is available and the g drive is mounted in the “standard” location of /content/gdrive.

Returns: the path to the data directory
Return type: str

get_data_dir()[source]#

get_venv_activate()[source]#: Finds the activation script for a running virtual environment or None if not running a venv.

main()[source]#: Show the path to the data_dir where school data is stored. To use the interactive downloader, run python -m nycschools.dataloader -d.

mount_colab_data_dir()[source]#: Try to mount a google drive directory in colab and then search for the data directory by looking for a directory with the known name ‘nyc-schools-data’.

set_env_var(data_dir)[source]#: Attempts to set the NYC_SCHOOLS_DATA_DIR environment variable based on the user’s platform.

nycschools.datasets module#

nycschools.exams module#

charter_cols(data)[source]#

load_charter_ela(url='https://data.cityofnewyork.us/resource/sgjd-xi99.csv?$limit=1000000')[source]#: Loads the charter school ELA exam results for the “All Students” category from the NYC Open Data Portal. Columns are re-named for consistency.

load_charter_math(url='https://data.cityofnewyork.us/resource/3xsw-bpuy.csv?$limit=1000000')[source]#: Loads the charter school Math exam results for the “All Students” category from the NYC Open Data Portal. Columns are re-named for consistency.

load_charter_test(url, filename)[source]#

load_ela()[source]#: Loads the New York State ELA grades 3-8 ELA exam results for all categories. If a local .csv data file exists, it will return results from that file. If no local file is available, itt will loads the Excel data file for from the NYC Data Portal and then cobmines the results with charter school data into a DataFrame. _This can be slow_.

load_ela_excel(url='https://data.cityofnewyork.us/api/views/hvdr-xc2s/files/4db8f0e7-0150-4302-bed7-529c89efa225?download=true&filename=school-ela-results-2013-2019-(public).xlsx')[source]#: Load NYS ELA test scores from the Excel workbook rather than the API. NYC DOE fails to release the demographic breakdowns for test results via its open data api.

load_math()[source]#: Loads the New York State Math grades 3-8 ELA exam results for all categories. If a local .csv data file exists, it will return results from that file. If no local file is available, itt will loads the Excel data file for from the NYC Data Portal and then cobmines the results with charter school data into a DataFrame. _This can be slow_.

load_math_ela_long()[source]#: Load a combined DataFrame with both math and ela test results in a “wide” data format. All of the math result columns have the suffix _math and the ELA columns have the suffix _ela.

load_math_ela_wide()[source]#: Load a combined DataFrame with both math and ela test results in a “wide” data format. All of the math result columns have the suffix _math and the ELA columns have the suffix _ela.

load_math_excel(url='https://data.cityofnewyork.us/api/views/365g-7jtb/files/17910cb0-8a62-4037-84b5-f0b4c2b3f71f?download=true&filename=school-math-results-2013-2019-(public).xlsx')[source]#: Load NYS Math test scores from the Excel workbook rather than the API. NYC DOE fails to release the demographic breakdowns for test results via its open data api.

load_regents()[source]#: Loads the New York State Regents exam scores for all categories. @return DataFrame

load_regents_excel()[source]#

read_nys_exam_excel(url)[source]#

nycschools.geo module#

add_labels(ax, df, col, fontsize=14)[source]#

get_and_save_locations(filename='/opt/nycschools/school_locations.geojson')[source]#

get_locations(url='https://data.cityofnewyork.us/resource/wg9x-4ke6.csv?$limit=1000000')[source]#: Read school level data with many location-related columns: school x,y coords, and data about the school locations including NYS BEDS ids, census tract, and police precinct.

get_points(geojsonurl='https://data.cityofnewyork.us/resource/a3nt-yts4.geojson?$limit=1000000')[source]#: Read the school location points and zipcodes from an Open Data Portal GeoJSON URL

load_districts(url='https://data.cityofnewyork.us/api/geospatial/r8nu-ymqj?method=export&format=GeoJSON')[source]#: Get geo shape file for NYC school districts, indexed by district number.

load_school_geo_points()[source]#: Load only the school location points as a GeoDataFrame

load_school_locations()[source]#: Returns a GeoDataFrame with the school locations and location meta-data

load_zipcodes()[source]#: Load the NYC zip code boundaries as a GeoDataFrame from data_dir. Zip codes are compiled from the NYC Data Portal via the US Post Office

nycschools.nysed module#

calc_all_grades(nysed)[source]#: NYSED data doesn’t include All Grades like NYC schools, so this function adds an “All Grades” category with aggregate data for each school.

download_and_extract(url, tmp)[source]#: Download and extract the .zip archives from NYSED

fix_cols(df)[source]#: Fixes the columns in the dataframe so that each NYSED year is consistent with the other years’ columns and column name/formats match other NYC DOE columns in the data portal.

fix_data(df)[source]#: Cleans data in the dataframe so that row-level data is consistent across test years and matches NYC math/ela data sets: - student categories are consistent - counts and percents are consistent - exam category is consistent

load_nyc_nysed()[source]#: Load the subset set of the load_nys_nysed data for schools in the New York City Department of Education school demographics data set.

load_nys_nysed()[source]#

Load the grades 3-8 math and ela exam results for all schools and distrcicts in New York State, in a long data format. This is the only data set that has demographic test results for charter schools in NYC. This data has all of the same columns as the NYC load_math_ela_long. Note that NYS does not report the same demographic categories as New York City. In particular, the two groups handle ENL students differently. NYC has 3 categories (current ell, ever ell, never ell) where NYS only has ELL and Non-ELL.

The NYSED data includes categories that are not part of the NYC data such as homelessness, foster care, and parents in armed services.

load_nysed_ela_math_archives(urls=['https://data.nysed.gov/files/assessment/20-21/3-8-2020-21.zip', 'https://data.nysed.gov/files/assessment/18-19/3-8-2018-19.zip', 'https://data.nysed.gov/files/assessment/17-18/3-8-2017-18.zip', 'https://data.nysed.gov/files/assessment/16-17/3-8-2016-17.zip', 'https://data.nysed.gov/files/assessment/15-16/3-8-2015-16.zip'])[source]#: Downloads all of the zip archives listed in the urls configuraiton. Extract the archives into a temp folder and read them as dataframes. Concats the dataframes into a single df and return.

map_nysed_nyc(nysed)[source]#

read_nysed_exam(filename)[source]#: Read the Excel file that has all test scores for all schools and districts in New York State.

nycschools.schools module#

clean_name(sn)[source]#: Creates a simplified school name that is easier to search/index

class demo[source]#

Bases: object

The demo class bundles some common sets of column names to make it easier to work with the school demographic DataFrame

from nycschools import schools

df = schools.load_school_demographics() basic = df[schools.demo.short_cols] basic.head()

core_cols = ['dbn', 'beds', 'district', 'geo_district', 'boro', 'school_name', 'ay', 'total_enrollment', 'female_n', 'female_pct', 'male_n', 'male_pct', 'asian_n', 'asian_pct', 'black_n', 'black_pct', 'hispanic_n', 'hispanic_pct', 'white_n', 'white_pct', 'swd_n', 'swd_pct', 'ell_n', 'ell_pct', 'poverty_n', 'poverty_pct', 'eni_pct']#

default_cols = ['dbn', 'beds', 'district', 'geo_district', 'boro', 'school_name', 'short_name', 'ay', 'year', 'school_type', 'total_enrollment', 'grade_3k_pk_half_day_full', 'grade_k', 'grade_1', 'grade_2', 'grade_3', 'grade_4', 'grade_5', 'grade_6', 'grade_7', 'grade_8', 'grade_9', 'grade_10', 'grade_11', 'grade_12', 'female_n', 'female_pct', 'male_n', 'male_pct', 'asian_n', 'asian_pct', 'black_n', 'black_pct', 'hispanic_n', 'hispanic_pct', 'multi_racial_n', 'multi_racial_pct', 'native_american_n', 'native_american_pct', 'white_n', 'white_pct', 'missing_race_ethnicity_data_n', 'missing_race_ethnicity_data_pct', 'swd_n', 'swd_pct', 'ell_n', 'ell_pct', 'poverty_n', 'poverty_pct', 'eni_pct', 'clean_name', 'zip']#

default_map = {'asian': 'asian_n', 'asian_1': 'asian_pct', 'black': 'black_n', 'black_1': 'black_pct', 'economic_need_index': 'eni_pct', 'english_language_learners': 'ell_n', 'english_language_learners_1': 'ell_pct', 'female': 'female_n', 'female_1': 'female_pct', 'hispanic': 'hispanic_n', 'hispanic_1': 'hispanic_pct', 'male': 'male_n', 'male_1': 'male_pct', 'missing_race_ethnicity_data': 'missing_race_ethnicity_data_n', 'missing_race_ethnicity_data_1': 'missing_race_ethnicity_data_pct', 'multi_racial': 'multi_racial_n', 'multi_racial_1': 'multi_racial_pct', 'native_american': 'native_american_n', 'native_american_1': 'native_american_pct', 'poverty': 'poverty_n', 'poverty_1': 'poverty_pct', 'students_with_disabilities': 'swd_n', 'students_with_disabilities_1': 'swd_pct', 'white': 'white_n', 'white_1': 'white_pct'}#

raw_cols = ['dbn', 'school_name', 'year', 'total_enrollment', 'grade_3k_pk_half_day_full', 'grade_k', 'grade_1', 'grade_2', 'grade_3', 'grade_4', 'grade_5', 'grade_6', 'grade_7', 'grade_8', 'grade_9', 'grade_10', 'grade_11', 'grade_12', 'female', 'female_1', 'male', 'male_1', 'asian', 'asian_1', 'black', 'black_1', 'hispanic', 'hispanic_1', 'multi_racial', 'multi_racial_1', 'native_american', 'native_american_1', 'white', 'white_1', 'missing_race_ethnicity_data', 'missing_race_ethnicity_data_1', 'students_with_disabilities', 'students_with_disabilities_1', 'english_language_learners', 'english_language_learners_1', 'poverty', 'poverty_1', 'economic_need_index']#

short_cols = ['dbn', 'school_name', 'short_name', 'clean_name', 'ay']#

join_loc_data(df)[source]#: Join NYS BEDS id, zip code, and other location data.

load_hs_directory(ay=2021)[source]#

Loads the NYC High School Directory data from the NYC Open Data Portal. This is a thin wrapper around pd.read_csv() and requires an internet connection. By default, the most recent directory is returned. For a different academic year, pass the ay parameter. Data is available for academic years 2013-2021. The data varies greatly from year to year, so they are not compiled into a single DataFrame.

Parameters: ay (int , default 2021) – the academic year for the directory data
Returns: a pandas DataFrame holding the high school directory data
Return type: DataFrame

load_school_demographics()[source]#

Loads the NYC school-level demographic data from the open data portal and create a dataframe.

Adds new columns to the data:

short_name: the best guess for the common name of the school (e.g. PS 9)

or “” if none exists

district: the school district number [1..32, 75, 79, 84]
boro: borough code string [“M”,”B”,”X”,”Q”,”R”]

boro_name: string of the full borough name

year: the academic year as an integer representing the calendar
year in the fall of the school year

white_asian: combined number of white and asian students
non_white: total number of non white students

non_white_asian: total number of non white or asian students

black_hispanic: total number of black and hispanic students

charter: 1 for charter schools, 0 for community public schools

beds: the New York State BEDS code for the school
zip: school zip code

geo_district: the geographic school district where the school is located
will differ for schools where district is > 32

Returns: a pandas DataFrame holding school demographic data for all of the schools in the data portal
Return type: DataFrame

save_demographics(url='https://data.cityofnewyork.us/resource/vmmu-wj3w.csv?$limit=1000000')[source]#

Loads and cleans school demographic data from a NYC Open Data Portal URL. The data is joined with some location data to make it more easily merged with other data sets. A local copy of the data is saved as a .csv in the nycschools data directory.

Parameters: url (str , default loads the url from config.urls with the 'demographics' key) – the URL to the most recent NYC school demographics
Returns: a pandas DataFrame holding school demographic data for all of the schools in the data portal
Return type: DataFrame

school_type(school)[source]#: Any school that serves middle school kids is considered a middle school here.

search(df, qry)[source]#

Search a DataFrame for a school using fuzzy logic.

Parameters

df (DataFrame) – a pandas DataFrame containing school data, such as the data returned from load_school_demographics()
qry (the school name or search term to look for in the data set) –

Returns

the schools that match the qry or an empty DataFrame if no matches were found

Return type

DataFrame

short_name(row)[source]#: Attempts to guess the common “short name” for a school. For example, the full name might be “P.S. 015 Roberto Clemente”. This school is probably commonly referred to as PS 15.

str_count(row, col, enroll_col)[source]#: generic function to take whole number counts represented as Strings and convert to integers expects data to look like ‘246’, ‘Above 95%’, ‘Below 5%’ Example: df[“poverty”] = df.apply(lambda row: str_count(row, “poverty”, “total_enrollment”), axis = 1)

str_pct(row, pct_col, enroll_col)[source]#: generic function to take percentages represented as Strings and convert to real expects data to look like ‘84.33%’, ‘Above 95%’, ‘Below 5%’ Example: df[“economic_need_index”] = df.apply(lambda row: str_pct(row, “economic_need_index”, “total_enrollment”), axis = 1)

nycschools.shsat module#

load_admission_offers()[source]#

save_administration_offers()[source]#

nycschools.tools module#

quick_merge(a, b, on='dbn')[source]#

nycschools.ui module#

commas(n)[source]#

counter()[source]#

draw_model(nodes, pnodes, node_size, edges, labels, edge_labels, colors, cmap, node_dict)[source]#

edge_label(p, r)[source]#

fmt_num(col, n)[source]#

fmt_pearson(r)[source]#: Formats the Pearson’s R correlation table returned from pengouin.corr in the format r(df)={r}, p={p}. The r is rounded to 2 decimals, and p is rounded to 3 decimals.

fmt_table(df, col_map=None, pct_cols=[], num_cols=[])[source]#

hexmap(cmap)[source]#

infinite()[source]#

label_shapes(m, df, col, style={})[source]#: Create a function that will add the string of col to the center of each shape specified by

network_map(dv, params, coefs, pvalues)[source]#

nice_name(n)[source]#

pct(n)[source]#

plot_model(model)[source]#

popup(cols, style={'min-width': '200px'})[source]#

round_f(f, places)[source]#

ul(t)[source]#

Module contents#

get_config()[source]#

Initialize the configuration settings.

Parameters

None –

Returns

A namespace object with the following attributes: - data_dir : str

The path to the data directory.

urlsdict
A dictionary of URLs to download data if the local cache should be re-built.

Return type

SimpleNamespace

Notes

The location for local data files is determined by first looking for an environment variable called NYC_SCHOOLS_DATA_DIR. If this environment variable is not set, the data files are stored in a directory called school-data in the current directory. If this directory does not exist, it will be created.

To see and change these settings for your installation, run python -m nycschools.dataloader.

get_version()[source]#: Returns the version of the nycschools package

NYC Schools Open Data Portal

nycschools package

Contents

nycschools package#

Submodules#

nycschools.budgets module#

nycschools.cep module#

nycschools.class_size module#

nycschools.dataloader module#

nycschools.datasets module#

nycschools.exams module#

nycschools.geo module#

nycschools.nysed module#

nycschools.schools module#

nycschools.shsat module#

nycschools.tools module#

nycschools.ui module#

Module contents#