nycschools package
Contents
nycschools package#
Submodules#
nycschools.budgets module#
- get_galaxy_budgets()[source]#
Scrapes the ‘galaxy summary’ budget for all schools from the DOE website.
- get_galaxy_summary(dbn, ay, driver)[source]#
Gets the ‘galaxy summary’ budget for a school from the DOE website.
- Parameters
dbn (str) – The school’s DBN.
ay (int) – The school year, currently only the most recent school year (2022-2023) is available.
driver (selenium.webdriver.chrome.webdriver.WebDriver) – The Selenium webdriver.
- Returns
data – A single DataFrame that combines all of the budget data scraped from the web for the school specified by dbn and ay.
- Return type
pandas.DataFrame
nycschools.cep module#
nycschools.class_size module#
- get_class_22(url)[source]#
Read class size data for 2022 from the DOE InfoHub Excel file format. This data also contains pupil teacher ratios which are saved in a separate file.
- get_class_size()[source]#
Get class size data from the web and cleans class size data for each year that it is available in the datasets. Currently data is available for each year from 2009-2021 excluding the 2020-2021 school year.
- Returns
a pandas DataFrame holding school demographic data for all of the schools in the data portal
- Return type
DataFrame
nycschools.dataloader module#
- contains_data_files(path)[source]#
Checks to see if the specified path contains the data files required by this application.
- Parameters
path (str) – the path to check
- Returns
- True if the path contains the data files
required by this application, False otherwise
- Return type
bool
- download_archive(data_dir=None)[source]#
Downloads the school data archive to the local drive and saves it into data_dir then extracts the .7z archive. data_dir now contains the cleaned and compiled school data files.
- Parameters
data_dir (str) – the path to the directory where the data files should be saved. If not specified, the package configuration data_dir is used.
- Returns
the path to the downloaded file
- Return type
str
- download_cache()[source]#
Download the data archive and save it to the local drive. This interactive terminal program prompts the user for the location to save the data files. Once the files are downloaded and expanded it attempts to write the path to the data files into the python configuration environment.
- find_config_file()[source]#
Looks for virtual environment activation scripts or bash configuration files in known locations.
- Returns
the path to the configuration file or None if not found
- Return type
str
- find_data_dir(config)[source]#
Tries to find an existing data directory populated with data, including searching through mounted google drive if the colab package is available and the g drive is mounted in the “standard” location of /content/gdrive.
- Returns
the path to the data directory
- Return type
str
- get_venv_activate()[source]#
Finds the activation script for a running virtual environment or None if not running a venv.
- main()[source]#
Show the path to the data_dir where school data is stored. To use the interactive downloader, run python -m nycschools.dataloader -d.
nycschools.datasets module#
nycschools.exams module#
- load_charter_ela(url='https://data.cityofnewyork.us/resource/sgjd-xi99.csv?$limit=1000000')[source]#
Loads the charter school ELA exam results for the “All Students” category from the NYC Open Data Portal. Columns are re-named for consistency.
- load_charter_math(url='https://data.cityofnewyork.us/resource/3xsw-bpuy.csv?$limit=1000000')[source]#
Loads the charter school Math exam results for the “All Students” category from the NYC Open Data Portal. Columns are re-named for consistency.
- load_ela()[source]#
Loads the New York State ELA grades 3-8 ELA exam results for all categories. If a local .csv data file exists, it will return results from that file. If no local file is available, itt will loads the Excel data file for from the NYC Data Portal and then cobmines the results with charter school data into a DataFrame. _This can be slow_.
- load_ela_excel(url='https://data.cityofnewyork.us/api/views/hvdr-xc2s/files/4db8f0e7-0150-4302-bed7-529c89efa225?download=true&filename=school-ela-results-2013-2019-(public).xlsx')[source]#
Load NYS ELA test scores from the Excel workbook rather than the API. NYC DOE fails to release the demographic breakdowns for test results via its open data api.
- load_math()[source]#
Loads the New York State Math grades 3-8 ELA exam results for all categories. If a local .csv data file exists, it will return results from that file. If no local file is available, itt will loads the Excel data file for from the NYC Data Portal and then cobmines the results with charter school data into a DataFrame. _This can be slow_.
- load_math_ela_long()[source]#
Load a combined DataFrame with both math and ela test results in a “wide” data format. All of the math result columns have the suffix _math and the ELA columns have the suffix _ela.
- load_math_ela_wide()[source]#
Load a combined DataFrame with both math and ela test results in a “wide” data format. All of the math result columns have the suffix _math and the ELA columns have the suffix _ela.
- load_math_excel(url='https://data.cityofnewyork.us/api/views/365g-7jtb/files/17910cb0-8a62-4037-84b5-f0b4c2b3f71f?download=true&filename=school-math-results-2013-2019-(public).xlsx')[source]#
Load NYS Math test scores from the Excel workbook rather than the API. NYC DOE fails to release the demographic breakdowns for test results via its open data api.
nycschools.geo module#
- get_locations(url='https://data.cityofnewyork.us/resource/wg9x-4ke6.csv?$limit=1000000')[source]#
Read school level data with many location-related columns: school x,y coords, and data about the school locations including NYS BEDS ids, census tract, and police precinct.
- get_points(geojsonurl='https://data.cityofnewyork.us/resource/a3nt-yts4.geojson?$limit=1000000')[source]#
Read the school location points and zipcodes from an Open Data Portal GeoJSON URL
- load_districts(url='https://data.cityofnewyork.us/api/geospatial/r8nu-ymqj?method=export&format=GeoJSON')[source]#
Get geo shape file for NYC school districts, indexed by district number.
nycschools.nysed module#
- calc_all_grades(nysed)[source]#
NYSED data doesn’t include All Grades like NYC schools, so this function adds an “All Grades” category with aggregate data for each school.
- fix_cols(df)[source]#
Fixes the columns in the dataframe so that each NYSED year is consistent with the other years’ columns and column name/formats match other NYC DOE columns in the data portal.
- fix_data(df)[source]#
Cleans data in the dataframe so that row-level data is consistent across test years and matches NYC math/ela data sets: - student categories are consistent - counts and percents are consistent - exam category is consistent
- load_nyc_nysed()[source]#
Load the subset set of the load_nys_nysed data for schools in the New York City Department of Education school demographics data set.
- load_nys_nysed()[source]#
Load the grades 3-8 math and ela exam results for all schools and distrcicts in New York State, in a long data format. This is the only data set that has demographic test results for charter schools in NYC. This data has all of the same columns as the NYC load_math_ela_long. Note that NYS does not report the same demographic categories as New York City. In particular, the two groups handle ENL students differently. NYC has 3 categories (current ell, ever ell, never ell) where NYS only has ELL and Non-ELL.
The NYSED data includes categories that are not part of the NYC data such as homelessness, foster care, and parents in armed services.
- load_nysed_ela_math_archives(urls=['https://data.nysed.gov/files/assessment/20-21/3-8-2020-21.zip', 'https://data.nysed.gov/files/assessment/18-19/3-8-2018-19.zip', 'https://data.nysed.gov/files/assessment/17-18/3-8-2017-18.zip', 'https://data.nysed.gov/files/assessment/16-17/3-8-2016-17.zip', 'https://data.nysed.gov/files/assessment/15-16/3-8-2015-16.zip'])[source]#
Downloads all of the zip archives listed in the urls configuraiton. Extract the archives into a temp folder and read them as dataframes. Concats the dataframes into a single df and return.
nycschools.schools module#
- class demo[source]#
Bases:
object
The demo class bundles some common sets of column names to make it easier to work with the school demographic DataFrame
from nycschools import schools
df = schools.load_school_demographics() basic = df[schools.demo.short_cols] basic.head()
- core_cols = ['dbn', 'beds', 'district', 'geo_district', 'boro', 'school_name', 'ay', 'total_enrollment', 'female_n', 'female_pct', 'male_n', 'male_pct', 'asian_n', 'asian_pct', 'black_n', 'black_pct', 'hispanic_n', 'hispanic_pct', 'white_n', 'white_pct', 'swd_n', 'swd_pct', 'ell_n', 'ell_pct', 'poverty_n', 'poverty_pct', 'eni_pct']#
- default_cols = ['dbn', 'beds', 'district', 'geo_district', 'boro', 'school_name', 'short_name', 'ay', 'year', 'school_type', 'total_enrollment', 'grade_3k_pk_half_day_full', 'grade_k', 'grade_1', 'grade_2', 'grade_3', 'grade_4', 'grade_5', 'grade_6', 'grade_7', 'grade_8', 'grade_9', 'grade_10', 'grade_11', 'grade_12', 'female_n', 'female_pct', 'male_n', 'male_pct', 'asian_n', 'asian_pct', 'black_n', 'black_pct', 'hispanic_n', 'hispanic_pct', 'multi_racial_n', 'multi_racial_pct', 'native_american_n', 'native_american_pct', 'white_n', 'white_pct', 'missing_race_ethnicity_data_n', 'missing_race_ethnicity_data_pct', 'swd_n', 'swd_pct', 'ell_n', 'ell_pct', 'poverty_n', 'poverty_pct', 'eni_pct', 'clean_name', 'zip']#
- default_map = {'asian': 'asian_n', 'asian_1': 'asian_pct', 'black': 'black_n', 'black_1': 'black_pct', 'economic_need_index': 'eni_pct', 'english_language_learners': 'ell_n', 'english_language_learners_1': 'ell_pct', 'female': 'female_n', 'female_1': 'female_pct', 'hispanic': 'hispanic_n', 'hispanic_1': 'hispanic_pct', 'male': 'male_n', 'male_1': 'male_pct', 'missing_race_ethnicity_data': 'missing_race_ethnicity_data_n', 'missing_race_ethnicity_data_1': 'missing_race_ethnicity_data_pct', 'multi_racial': 'multi_racial_n', 'multi_racial_1': 'multi_racial_pct', 'native_american': 'native_american_n', 'native_american_1': 'native_american_pct', 'poverty': 'poverty_n', 'poverty_1': 'poverty_pct', 'students_with_disabilities': 'swd_n', 'students_with_disabilities_1': 'swd_pct', 'white': 'white_n', 'white_1': 'white_pct'}#
- raw_cols = ['dbn', 'school_name', 'year', 'total_enrollment', 'grade_3k_pk_half_day_full', 'grade_k', 'grade_1', 'grade_2', 'grade_3', 'grade_4', 'grade_5', 'grade_6', 'grade_7', 'grade_8', 'grade_9', 'grade_10', 'grade_11', 'grade_12', 'female', 'female_1', 'male', 'male_1', 'asian', 'asian_1', 'black', 'black_1', 'hispanic', 'hispanic_1', 'multi_racial', 'multi_racial_1', 'native_american', 'native_american_1', 'white', 'white_1', 'missing_race_ethnicity_data', 'missing_race_ethnicity_data_1', 'students_with_disabilities', 'students_with_disabilities_1', 'english_language_learners', 'english_language_learners_1', 'poverty', 'poverty_1', 'economic_need_index']#
- short_cols = ['dbn', 'school_name', 'short_name', 'clean_name', 'ay']#
- load_hs_directory(ay=2021)[source]#
Loads the NYC High School Directory data from the NYC Open Data Portal. This is a thin wrapper around pd.read_csv() and requires an internet connection. By default, the most recent directory is returned. For a different academic year, pass the ay parameter. Data is available for academic years 2013-2021. The data varies greatly from year to year, so they are not compiled into a single DataFrame.
- Parameters
ay (int , default 2021) – the academic year for the directory data
- Returns
a pandas DataFrame holding the high school directory data
- Return type
DataFrame
- load_school_demographics()[source]#
Loads the NYC school-level demographic data from the open data portal and create a dataframe.
Adds new columns to the data:
- short_name: the best guess for the common name of the school (e.g. PS 9)
or “” if none exists
- district: the school district number [1..32, 75, 79, 84]
boro: borough code string [“M”,”B”,”X”,”Q”,”R”]
- boro_name: string of the full borough name
- year: the academic year as an integer representing the calendar
year in the fall of the school year
- white_asian: combined number of white and asian students
non_white: total number of non white students
- non_white_asian: total number of non white or asian students
- black_hispanic: total number of black and hispanic students
- charter: 1 for charter schools, 0 for community public schools
- beds: the New York State BEDS code for the school
zip: school zip code
- geo_district: the geographic school district where the school is located
will differ for schools where district is > 32
- Returns
a pandas DataFrame holding school demographic data for all of the schools in the data portal
- Return type
DataFrame
- save_demographics(url='https://data.cityofnewyork.us/resource/vmmu-wj3w.csv?$limit=1000000')[source]#
Loads and cleans school demographic data from a NYC Open Data Portal URL. The data is joined with some location data to make it more easily merged with other data sets. A local copy of the data is saved as a .csv in the nycschools data directory.
- Parameters
url (str , default loads the url from config.urls with the 'demographics' key) – the URL to the most recent NYC school demographics
- Returns
a pandas DataFrame holding school demographic data for all of the schools in the data portal
- Return type
DataFrame
- school_type(school)[source]#
Any school that serves middle school kids is considered a middle school here.
- search(df, qry)[source]#
Search a DataFrame for a school using fuzzy logic.
- Parameters
df (DataFrame) – a pandas DataFrame containing school data, such as the data returned from load_school_demographics()
qry (the school name or search term to look for in the data set) –
- Returns
the schools that match the qry or an empty DataFrame if no matches were found
- Return type
DataFrame
- short_name(row)[source]#
Attempts to guess the common “short name” for a school. For example, the full name might be “P.S. 015 Roberto Clemente”. This school is probably commonly referred to as PS 15.
- str_count(row, col, enroll_col)[source]#
generic function to take whole number counts represented as Strings and convert to integers expects data to look like ‘246’, ‘Above 95%’, ‘Below 5%’ Example: df[“poverty”] = df.apply(lambda row: str_count(row, “poverty”, “total_enrollment”), axis = 1)
- str_pct(row, pct_col, enroll_col)[source]#
generic function to take percentages represented as Strings and convert to real expects data to look like ‘84.33%’, ‘Above 95%’, ‘Below 5%’ Example: df[“economic_need_index”] = df.apply(lambda row: str_pct(row, “economic_need_index”, “total_enrollment”), axis = 1)
nycschools.shsat module#
nycschools.tools module#
nycschools.ui module#
- fmt_pearson(r)[source]#
Formats the Pearson’s R correlation table returned from pengouin.corr in the format r(df)={r}, p={p}. The r is rounded to 2 decimals, and p is rounded to 3 decimals.
Module contents#
- get_config()[source]#
Initialize the configuration settings.
- Parameters
None –
- Returns
A namespace object with the following attributes: - data_dir : str
The path to the data directory.
- urlsdict
A dictionary of URLs to download data if the local cache should be re-built.
- Return type
SimpleNamespace
Notes
The location for local data files is determined by first looking for an environment variable called NYC_SCHOOLS_DATA_DIR. If this environment variable is not set, the data files are stored in a directory called school-data in the current directory. If this directory does not exist, it will be created.
To see and change these settings for your installation, run python -m nycschools.dataloader.