# ANOVA and OLS Regression#

This notebook combines school demographic data and New York State ELA test scores to examine the factors that predict test scores. In it we run a t-test to test for statistic difference in the scores for White students and Black students.

We then run analysis of variance between the 4 main ethnic/racial groups to see if there is statistical significance in the outcomes. We run an OLS regression on these groups with the “All Students” as a baseline reference category and display the ANOVA table and regression summary.

Finally, one run a different OLS regression to examine the school demographic factors and their effect on test scores.

In this notebook:

• t-test (review)

• ANOVA (analysis of variance)

• OLS (ordinary least squares) regression

# load the demographic data
import pandas as pd
import numpy as np
import scipy as scipy

import pingouin as pg

import statsmodels.api as sm

from IPython.display import Markdown as md

from nycschools import schools, exams

# load the demographic data and merge it with the ELA data

# load the data from the csv file

#drop the rows with NaN (where the pop is too small to report)
ela = ela[ela["mean_scale_score"].notnull()]
df = df.merge(ela, how="inner", on=["dbn", "ay"])

# for this analysis we will only look at grade 8 scores for the 2018-19 (pre-covid) school year
# the last pre-covid year
df = df[df.ay == 2018]
df = df[df.mean_scale_score.notnull()]

# create 5 groups as independent data frames

all_students = df[df["category"] == "All Students"][["dbn", "mean_scale_score"]]
black = df[df["category"] == "Black"][["dbn", "mean_scale_score"]]
white = df[df["category"] == "White"][["dbn", "mean_scale_score"]]
hispanic = df[df["category"] == "Hispanic"][["dbn", "mean_scale_score"]]
asian = df[df["category"] == "Asian"][["dbn", "mean_scale_score"]]


## t-test: white and black students#

Before running the ANOVA and further analysis, we will run a t-test between Black and White students to determine if there is a significant difference between their average test score.

# calculate the mean test score and standard deviation for each group
mean_std = df.groupby('category').agg(Mean=('mean_scale_score', np.mean), STD=('mean_scale_score', np.std))
display(md("**Mean average and standard deviation of test scores for each group.**"))
display(mean_std)

# run a t-test to see if there is a statistical difference between white and black student scores
t = scipy.stats.ttest_ind(white["mean_scale_score"],black["mean_scale_score"])

display(md(f"""
**T-Test results** comparing school averages of
White (n={white["dbn"].count()}) and Black (n={black["dbn"].count()})

- White students: M={white["mean_scale_score"].mean():.04f}, SD={white["mean_scale_score"].std():.04f}
- Black students: M={black["mean_scale_score"].mean()}, SD={black["mean_scale_score"].std():.04f}
- T-score: {round(t.statistic, 4)}, p-val: {round(t.pvalue, 4)}

n values report the number of schools observed, not the number of test takers. Further analysis
will report on the t-test for weighted means that account for school size.

We see that there is a statistically significance difference in test scores between the groups.
"""))


Mean average and standard deviation of test scores for each group.

Mean STD
category
All Students 601.561093 9.266628
Asian 610.011383 10.234169
Black 598.277534 7.550302
Current ELL 577.753650 6.179677
Ever ELL 604.167463 6.260445
Female 603.935732 9.382392
Hispanic 598.379582 8.639729
Male 597.740034 9.912043
Never ELL 602.036189 8.229933
Not SWD 603.212482 9.386994
SWD 587.826847 7.421250
White 607.967339 11.660535

T-Test results comparing school averages of White (n=191) and Black (n=348) students in 8th grade student ELA scores for 2019-20 academic year.

• White students: M=607.9673, SD=11.6605

• Black students: M=598.2775343971265, SD=7.5503

• T-score: 11.675, p-val: 0.0

n values report the number of schools observed, not the number of test takers. Further analysis will report on the t-test for weighted means that account for school size.

We see that there is a statistically significance difference in test scores between the groups.

## ANOVA & f-values#

In the example below we calculate the f-statistic to see if there are significant differences in test scores based on racial/ethnic group of the test takers. We compare the four main groups in NYC Schools: Asian, Black, Hispanic, and White.

# run a one way anova to test if there is significant difference between the
# average test scores at the school level of 4 different racial/ethnic groups

fvalue, pvalue = scipy.stats.f_oneway(
asian["mean_scale_score"],
black["mean_scale_score"],
hispanic["mean_scale_score"],
white["mean_scale_score"])

results = f"""
A **one-way between subjects ANOVA** was conducted to compare the effect of
racial/ethnic group on the test score for 8th grade NYS ELA exams for

The four groups in the test are: Asian (n={len(asian)}), Black (n={len(black)}), Latinx (n={len(hispanic)}),
and White (n={len(white)}) students.

The was a significant effect of racial/ethnic group on test score at
the p<.001 level for the four conditions, [p={pvalue:.04f}, F={fvalue:.04f}].

_n values are the number of schools reported, not the number of test takers in each group_
"""
md(results)


A one-way between subjects ANOVA was conducted to compare the effect of racial/ethnic group on the test score for 8th grade NYS ELA exams for during the 2018-2019 academic year.

The four groups in the test are: Asian (n=187), Black (n=348), Latinx (n=404), and White (n=191) students.

The was a significant effect of racial/ethnic group on test score at the p<.001 level for the four conditions, [p=0.0000, F=113.9984].

n values are the number of schools reported, not the number of test takers in each group

### Pingouin results#

When we ran the t-test we saw that pingouin offers some useful additional features. The API (syntax for using) pingouin differes from scipy. Before we run the test, we create a new single dataframe with just the columns we care about. We tell the function which column is the dependent variable and which specifies the groups.

Below we show the pinguion ANOVA.

# pg.anova()
data = df.copy()
data = data[data.category.isin(["Asian", "Black", "Hispanic", "White"])]
data[["category", "mean_scale_score"]]

pg.anova(dv='mean_scale_score', between='category', data=data, detailed=True)

Source SS DF MS F p-unc np2
0 category 28908.194891 3 9636.064964 113.998413 1.832508e-64 0.232968
1 Within 95178.598593 1126 84.528063 NaN NaN NaN

## OLS Linear Regression#

The ANOVA tells us that there is a significant difference in test result based on racial/ethnic group. We can run a regression analysis to help us isolate the impact of different factors on our mean_scale_score – our dependent variable.

For this analysis we will look at the school demographics to analyze the mean ELA test score for All Students at the school.

# first choose the "factors" from our data fields that we believe impact mean_scale_score
data = df.copy()
data = data[data.category == "All Students"]
factors = [
'total_enrollment',
'female_pct',
'asian_pct',
'black_pct',
'hispanic_pct',
'white_pct',
'swd_pct',
'ell_pct',
'poverty_pct',
'eni_pct',
'charter']

y = data['mean_scale_score']
X = data[factors]
model = sm.OLS(y, X).fit()

model.summary()

Dep. Variable: R-squared: mean_scale_score 0.699 OLS 0.694 Least Squares 131.4 Tue, 28 Feb 2023 3.87e-154 17:41:20 -1929.9 634 3884. 622 3937. 11 nonrobust
coef std err t P>|t| [0.025 0.975] 615.1489 10.519 58.479 0.000 594.492 635.806 -0.0002 0.001 -0.283 0.777 -0.002 0.001 8.4008 2.380 3.530 0.000 3.727 13.074 12.3692 11.826 1.046 0.296 -10.855 35.593 -1.9667 11.588 -0.170 0.865 -24.723 20.790 4.0233 11.462 0.351 0.726 -18.485 26.532 6.6385 11.235 0.591 0.555 -15.425 28.702 -38.2322 3.837 -9.964 0.000 -45.767 -30.697 -35.6202 2.406 -14.807 0.000 -40.344 -30.896 -8.5422 3.808 -2.243 0.025 -16.020 -1.065 -2.6267 3.664 -0.717 0.474 -9.821 4.568 2.1596 0.598 3.614 0.000 0.986 3.333
 Omnibus: Durbin-Watson: 34.045 1.67 0 69.095 0.326 9.92e-16 4.48 90500

Notes:
 Standard Errors assume that the covariance matrix of the errors is correctly specified.
 The condition number is large, 9.05e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
# we can also pull specific data from the model
# here we create our own table with the factors, coefficients and p-values
params = list(model.params.index.values[1:])
coefs = list(model.params.values[1:],)
pvalues = list(model.pvalues[1:])

table = pd.DataFrame({"factor":params,"coef":coefs,"p-values":pvalues})
table.sort_values(by="coef")

factor coef p-values
6 swd_pct -38.232201 8.505317e-22
7 ell_pct -35.620239 1.030998e-42
8 poverty_pct -8.542162 2.522178e-02
9 eni_pct -2.626731 4.736688e-01
3 black_pct -1.966710 8.652873e-01
0 total_enrollment -0.000191 7.772975e-01
10 charter 2.159607 3.258558e-04
4 hispanic_pct 4.023272 7.256952e-01
5 white_pct 6.638455 5.548231e-01
1 female_pct 8.400780 4.460913e-04
2 asian_pct 12.369227 2.960056e-01