ANOVA and OLS Regression#

This notebook combines school demographic data and New York State ELA test scores to examine the factors that predict test scores. In it we run a t-test to test for statistic difference in the scores for White students and Black students.

We then run analysis of variance between the 4 main ethnic/racial groups to see if there is statistical significance in the outcomes. We run an OLS regression on these groups with the “All Students” as a baseline reference category and display the ANOVA table and regression summary.

Finally, one run a different OLS regression to examine the school demographic factors and their effect on test scores.

In this notebook:

  • t-test (review)

  • ANOVA (analysis of variance)

  • OLS (ordinary least squares) regression

# load the demographic data
import pandas as pd
import numpy as np
import scipy as scipy

import pingouin as pg

import statsmodels.api as sm

from IPython.display import Markdown as md

from nycschools import schools, exams
# load the demographic data and merge it with the ELA data
df = schools.load_school_demographics()

# load the data from the csv file
ela = exams.load_ela()


#drop the rows with NaN (where the pop is too small to report)
ela = ela[ela["mean_scale_score"].notnull()]
df = df.merge(ela, how="inner", on=["dbn", "ay"])

# for this analysis we will only look at grade 8 scores for the 2018-19 (pre-covid) school year
# the last pre-covid year
df = df[df.grade =='8']
df = df[df.ay == 2018]
df = df[df.mean_scale_score.notnull()]

# create 5 groups as independent data frames

all_students = df[df["category"] == "All Students"][["dbn", "mean_scale_score"]]
black = df[df["category"] == "Black"][["dbn", "mean_scale_score"]]
white = df[df["category"] == "White"][["dbn", "mean_scale_score"]]
hispanic = df[df["category"] == "Hispanic"][["dbn", "mean_scale_score"]]
asian = df[df["category"] == "Asian"][["dbn", "mean_scale_score"]]

t-test: white and black students#

Before running the ANOVA and further analysis, we will run a t-test between Black and White students to determine if there is a significant difference between their average test score.

# calculate the mean test score and standard deviation for each group
mean_std = df.groupby('category').agg(Mean=('mean_scale_score', np.mean), STD=('mean_scale_score', np.std))
display(md("**Mean average and standard deviation of test scores for each group.**"))
display(mean_std)


# run a t-test to see if there is a statistical difference between white and black student scores
t = scipy.stats.ttest_ind(white["mean_scale_score"],black["mean_scale_score"])

display(md(f"""
**T-Test results** comparing school averages of 
White (`n={white["dbn"].count()}`) and Black (`n={black["dbn"].count()}`)
students in 8th grade student ELA scores for 2019-20 academic year.

- White students: M={white["mean_scale_score"].mean():.04f}, SD={white["mean_scale_score"].std():.04f}
- Black students: M={black["mean_scale_score"].mean()}, SD={black["mean_scale_score"].std():.04f}
- T-score: {round(t.statistic, 4)}, p-val: {round(t.pvalue, 4)}

`n` values report the number of schools observed, not the number of test takers. Further analysis
will report on the t-test for weighted means that account for school size.

We see that there is a statistically significance difference in test scores between the groups.
"""))

Mean average and standard deviation of test scores for each group.

Mean STD
category
All Students 601.561093 9.266628
Asian 610.011383 10.234169
Black 598.277534 7.550302
Current ELL 577.753650 6.179677
Econ Disadv 601.216605 8.690372
Ever ELL 604.167463 6.260445
Female 603.935732 9.382392
Hispanic 598.379582 8.639729
Male 597.740034 9.912043
Never ELL 602.036189 8.229933
Not Econ Disadv 605.578228 9.724134
Not SWD 603.212482 9.386994
SWD 587.826847 7.421250
White 607.967339 11.660535

T-Test results comparing school averages of White (n=191) and Black (n=348) students in 8th grade student ELA scores for 2019-20 academic year.

  • White students: M=607.9673, SD=11.6605

  • Black students: M=598.2775343971265, SD=7.5503

  • T-score: 11.675, p-val: 0.0

n values report the number of schools observed, not the number of test takers. Further analysis will report on the t-test for weighted means that account for school size.

We see that there is a statistically significance difference in test scores between the groups.

ANOVA & f-values#

In the example below we calculate the f-statistic to see if there are significant differences in test scores based on racial/ethnic group of the test takers. We compare the four main groups in NYC Schools: Asian, Black, Hispanic, and White.

# run a one way anova to test if there is significant difference between the
# average test scores at the school level of 4 different racial/ethnic groups

fvalue, pvalue = scipy.stats.f_oneway(
    asian["mean_scale_score"], 
    black["mean_scale_score"],
    hispanic["mean_scale_score"],
    white["mean_scale_score"])


results = f"""
A **one-way between subjects ANOVA** was conducted to compare the effect of 
racial/ethnic group on the test score for 8th grade NYS ELA exams for
during the 2018-2019 academic year.

The four groups in the test are: Asian (n={len(asian)}), Black (n={len(black)}), Latinx (n={len(hispanic)}),
and White (n={len(white)}) students.

The was a significant effect of racial/ethnic group on test score at
the p<.001 level for the four conditions, [p={pvalue:.04f}, F={fvalue:.04f}].

_`n` values are the number of schools reported, not the number of test takers in each group_
"""
md(results)

A one-way between subjects ANOVA was conducted to compare the effect of racial/ethnic group on the test score for 8th grade NYS ELA exams for during the 2018-2019 academic year.

The four groups in the test are: Asian (n=187), Black (n=348), Latinx (n=404), and White (n=191) students.

The was a significant effect of racial/ethnic group on test score at the p<.001 level for the four conditions, [p=0.0000, F=113.9984].

n values are the number of schools reported, not the number of test takers in each group

Pingouin results#

When we ran the t-test we saw that pingouin offers some useful additional features. The API (syntax for using) pingouin differes from scipy. Before we run the test, we create a new single dataframe with just the columns we care about. We tell the function which column is the dependent variable and which specifies the groups.

Below we show the pinguion ANOVA.

# pg.anova()
data = df.copy()
data = data[data.category.isin(["Asian", "Black", "Hispanic", "White"])]
data[["category", "mean_scale_score"]]

pg.anova(dv='mean_scale_score', between='category', data=data, detailed=True)
Source SS DF MS F p-unc np2
0 category 28908.194891 3 9636.064964 113.998413 1.832508e-64 0.232968
1 Within 95178.598593 1126 84.528063 NaN NaN NaN

OLS Linear Regression#

The ANOVA tells us that there is a significant difference in test result based on racial/ethnic group. We can run a regression analysis to help us isolate the impact of different factors on our mean_scale_score – our dependent variable.

For this analysis we will look at the school demographics to analyze the mean ELA test score for All Students at the school.

# first choose the "factors" from our data fields that we believe impact mean_scale_score
data = df.copy()
data = data[data.category == "All Students"]
factors = [
       'total_enrollment',
       'female_pct', 
       'asian_pct',  
       'black_pct',
       'hispanic_pct',  
       'white_pct',
       'swd_pct',  
       'ell_pct',  
       'poverty_pct',
       'eni_pct',
       'charter']

y = data['mean_scale_score']
X = data[factors]
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()

model.summary()
OLS Regression Results
Dep. Variable: mean_scale_score R-squared: 0.699
Model: OLS Adj. R-squared: 0.694
Method: Least Squares F-statistic: 131.4
Date: Tue, 28 Feb 2023 Prob (F-statistic): 3.87e-154
Time: 17:41:20 Log-Likelihood: -1929.9
No. Observations: 634 AIC: 3884.
Df Residuals: 622 BIC: 3937.
Df Model: 11
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 615.1489 10.519 58.479 0.000 594.492 635.806
total_enrollment -0.0002 0.001 -0.283 0.777 -0.002 0.001
female_pct 8.4008 2.380 3.530 0.000 3.727 13.074
asian_pct 12.3692 11.826 1.046 0.296 -10.855 35.593
black_pct -1.9667 11.588 -0.170 0.865 -24.723 20.790
hispanic_pct 4.0233 11.462 0.351 0.726 -18.485 26.532
white_pct 6.6385 11.235 0.591 0.555 -15.425 28.702
swd_pct -38.2322 3.837 -9.964 0.000 -45.767 -30.697
ell_pct -35.6202 2.406 -14.807 0.000 -40.344 -30.896
poverty_pct -8.5422 3.808 -2.243 0.025 -16.020 -1.065
eni_pct -2.6267 3.664 -0.717 0.474 -9.821 4.568
charter 2.1596 0.598 3.614 0.000 0.986 3.333
Omnibus: 34.045 Durbin-Watson: 1.670
Prob(Omnibus): 0.000 Jarque-Bera (JB): 69.095
Skew: 0.326 Prob(JB): 9.92e-16
Kurtosis: 4.480 Cond. No. 9.05e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.05e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
# we can also pull specific data from the model
# here we create our own table with the factors, coefficients and p-values
params = list(model.params.index.values[1:])
coefs = list(model.params.values[1:],)
pvalues = list(model.pvalues[1:])

table = pd.DataFrame({"factor":params,"coef":coefs,"p-values":pvalues})
table.sort_values(by="coef")
factor coef p-values
6 swd_pct -38.232201 8.505317e-22
7 ell_pct -35.620239 1.030998e-42
8 poverty_pct -8.542162 2.522178e-02
9 eni_pct -2.626731 4.736688e-01
3 black_pct -1.966710 8.652873e-01
0 total_enrollment -0.000191 7.772975e-01
10 charter 2.159607 3.258558e-04
4 hispanic_pct 4.023272 7.256952e-01
5 white_pct 6.638455 5.548231e-01
1 female_pct 8.400780 4.460913e-04
2 asian_pct 12.369227 2.960056e-01