ANOVA and OLS Regression#

This notebook combines school demographic data and New York State ELA test scores to examine the factors that predict test scores. In it we run a t-test to test for statistic difference in the scores for White students and Black students.

We then run analysis of variance between the 4 main ethnic/racial groups to see if there is statistical significance in the outcomes. We run an OLS regression on these groups with the “All Students” as a baseline reference category and display the ANOVA table and regression summary.

Finally, one run a different OLS regression to examine the school demographic factors and their effect on test scores.

In this notebook:

t-test (review)
ANOVA (analysis of variance)
OLS (ordinary least squares) regression

# load the demographic data
import pandas as pd
import numpy as np
import scipy as scipy

import pingouin as pg

import statsmodels.api as sm

from IPython.display import Markdown as md

from nycschools import schools, exams

# load the demographic data and merge it with the ELA data
df = schools.load_school_demographics()

# load the data from the csv file
ela = exams.load_ela()


#drop the rows with NaN (where the pop is too small to report)
ela = ela[ela["mean_scale_score"].notnull()]
df = df.merge(ela, how="inner", on=["dbn", "ay"])

# for this analysis we will only look at grade 8 scores for the 2018-19 (pre-covid) school year
# the last pre-covid year
df = df[df.grade =='8']
df = df[df.ay == 2018]
df = df[df.mean_scale_score.notnull()]

# create 5 groups as independent data frames

all_students = df[df["category"] == "All Students"][["dbn", "mean_scale_score"]]
black = df[df["category"] == "Black"][["dbn", "mean_scale_score"]]
white = df[df["category"] == "White"][["dbn", "mean_scale_score"]]
hispanic = df[df["category"] == "Hispanic"][["dbn", "mean_scale_score"]]
asian = df[df["category"] == "Asian"][["dbn", "mean_scale_score"]]

t-test: white and black students#

Before running the ANOVA and further analysis, we will run a t-test between Black and White students to determine if there is a significant difference between their average test score.

# calculate the mean test score and standard deviation for each group
mean_std = df.groupby('category').agg(Mean=('mean_scale_score', np.mean), STD=('mean_scale_score', np.std))
display(md("**Mean average and standard deviation of test scores for each group.**"))
display(mean_std)


# run a t-test to see if there is a statistical difference between white and black student scores
t = scipy.stats.ttest_ind(white["mean_scale_score"],black["mean_scale_score"])

display(md(f"""
**T-Test results** comparing school averages of 
White (`n={white["dbn"].count()}`) and Black (`n={black["dbn"].count()}`)
students in 8th grade student ELA scores for 2019-20 academic year.

- White students: M={white["mean_scale_score"].mean():.04f}, SD={white["mean_scale_score"].std():.04f}
- Black students: M={black["mean_scale_score"].mean()}, SD={black["mean_scale_score"].std():.04f}
- T-score: {round(t.statistic, 4)}, p-val: {round(t.pvalue, 4)}

`n` values report the number of schools observed, not the number of test takers. Further analysis
will report on the t-test for weighted means that account for school size.

We see that there is a statistically significance difference in test scores between the groups.
"""))

Mean average and standard deviation of test scores for each group.

	Mean	STD
category
All Students	601.561093	9.266628
Asian	610.011383	10.234169
Black	598.277534	7.550302
Current ELL	577.753650	6.179677
Econ Disadv	601.216605	8.690372
Ever ELL	604.167463	6.260445
Female	603.935732	9.382392
Hispanic	598.379582	8.639729
Male	597.740034	9.912043
Never ELL	602.036189	8.229933
Not Econ Disadv	605.578228	9.724134
Not SWD	603.212482	9.386994
SWD	587.826847	7.421250
White	607.967339	11.660535

T-Test results comparing school averages of White (n=191) and Black (n=348) students in 8th grade student ELA scores for 2019-20 academic year.

White students: M=607.9673, SD=11.6605
Black students: M=598.2775343971265, SD=7.5503
T-score: 11.675, p-val: 0.0

n values report the number of schools observed, not the number of test takers. Further analysis will report on the t-test for weighted means that account for school size.

We see that there is a statistically significance difference in test scores between the groups.

ANOVA & f-values#

In the example below we calculate the f-statistic to see if there are significant differences in test scores based on racial/ethnic group of the test takers. We compare the four main groups in NYC Schools: Asian, Black, Hispanic, and White.

# run a one way anova to test if there is significant difference between the
# average test scores at the school level of 4 different racial/ethnic groups

fvalue, pvalue = scipy.stats.f_oneway(
    asian["mean_scale_score"], 
    black["mean_scale_score"],
    hispanic["mean_scale_score"],
    white["mean_scale_score"])


results = f"""
A **one-way between subjects ANOVA** was conducted to compare the effect of 
racial/ethnic group on the test score for 8th grade NYS ELA exams for
during the 2018-2019 academic year.

The four groups in the test are: Asian (n={len(asian)}), Black (n={len(black)}), Latinx (n={len(hispanic)}),
and White (n={len(white)}) students.

The was a significant effect of racial/ethnic group on test score at
the p<.001 level for the four conditions, [p={pvalue:.04f}, F={fvalue:.04f}].

_`n` values are the number of schools reported, not the number of test takers in each group_
"""
md(results)

A one-way between subjects ANOVA was conducted to compare the effect of racial/ethnic group on the test score for 8th grade NYS ELA exams for during the 2018-2019 academic year.

The four groups in the test are: Asian (n=187), Black (n=348), Latinx (n=404), and White (n=191) students.

The was a significant effect of racial/ethnic group on test score at the p<.001 level for the four conditions, [p=0.0000, F=113.9984].

n values are the number of schools reported, not the number of test takers in each group

Pingouin results#

When we ran the t-test we saw that pingouin offers some useful additional features. The API (syntax for using) pingouin differes from scipy. Before we run the test, we create a new single dataframe with just the columns we care about. We tell the function which column is the dependent variable and which specifies the groups.

Below we show the pinguion ANOVA.

# pg.anova()
data = df.copy()
data = data[data.category.isin(["Asian", "Black", "Hispanic", "White"])]
data[["category", "mean_scale_score"]]

pg.anova(dv='mean_scale_score', between='category', data=data, detailed=True)

	Source	SS	DF	MS	F	p-unc	np2
0	category	28908.194891	3	9636.064964	113.998413	1.832508e-64	0.232968
1	Within	95178.598593	1126	84.528063	NaN	NaN	NaN

OLS Linear Regression#

The ANOVA tells us that there is a significant difference in test result based on racial/ethnic group. We can run a regression analysis to help us isolate the impact of different factors on our mean_scale_score – our dependent variable.

For this analysis we will look at the school demographics to analyze the mean ELA test score for All Students at the school.

# first choose the "factors" from our data fields that we believe impact mean_scale_score
data = df.copy()
data = data[data.category == "All Students"]
factors = [
       'total_enrollment',
       'female_pct', 
       'asian_pct',  
       'black_pct',
       'hispanic_pct',  
       'white_pct',
       'swd_pct',  
       'ell_pct',  
       'poverty_pct',
       'eni_pct',
       'charter']

y = data['mean_scale_score']
X = data[factors]
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()

model.summary()

OLS Regression Results
Dep. Variable:	mean_scale_score	R-squared:	0.699
Model:	OLS	Adj. R-squared:	0.694
Method:	Least Squares	F-statistic:	131.4
Date:	Tue, 28 Feb 2023	Prob (F-statistic):	3.87e-154
Time:	17:41:20	Log-Likelihood:	-1929.9
No. Observations:	634	AIC:	3884.
Df Residuals:	622	BIC:	3937.
Df Model:	11
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	615.1489	10.519	58.479	0.000	594.492	635.806
total_enrollment	-0.0002	0.001	-0.283	0.777	-0.002	0.001
female_pct	8.4008	2.380	3.530	0.000	3.727	13.074
asian_pct	12.3692	11.826	1.046	0.296	-10.855	35.593
black_pct	-1.9667	11.588	-0.170	0.865	-24.723	20.790
hispanic_pct	4.0233	11.462	0.351	0.726	-18.485	26.532
white_pct	6.6385	11.235	0.591	0.555	-15.425	28.702
swd_pct	-38.2322	3.837	-9.964	0.000	-45.767	-30.697
ell_pct	-35.6202	2.406	-14.807	0.000	-40.344	-30.896
poverty_pct	-8.5422	3.808	-2.243	0.025	-16.020	-1.065
eni_pct	-2.6267	3.664	-0.717	0.474	-9.821	4.568
charter	2.1596	0.598	3.614	0.000	0.986	3.333

Omnibus:	34.045	Durbin-Watson:	1.670
Prob(Omnibus):	0.000	Jarque-Bera (JB):	69.095
Skew:	0.326	Prob(JB):	9.92e-16
Kurtosis:	4.480	Cond. No.	9.05e+04

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.05e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

# we can also pull specific data from the model
# here we create our own table with the factors, coefficients and p-values
params = list(model.params.index.values[1:])
coefs = list(model.params.values[1:],)
pvalues = list(model.pvalues[1:])

table = pd.DataFrame({"factor":params,"coef":coefs,"p-values":pvalues})
table.sort_values(by="coef")

	factor	coef	p-values
6	swd_pct	-38.232201	8.505317e-22
7	ell_pct	-35.620239	1.030998e-42
8	poverty_pct	-8.542162	2.522178e-02
9	eni_pct	-2.626731	4.736688e-01
3	black_pct	-1.966710	8.652873e-01
0	total_enrollment	-0.000191	7.772975e-01
10	charter	2.159607	3.258558e-04
4	hispanic_pct	4.023272	7.256952e-01
5	white_pct	6.638455	5.548231e-01
1	female_pct	8.400780	4.460913e-04
2	asian_pct	12.369227	2.960056e-01

NYC Schools Open Data Portal

ANOVA and OLS Regression

Contents

ANOVA and OLS Regression#

t-test: white and black students#

ANOVA & f-values#

Pingouin results#

OLS Linear Regression#