ANOVA and OLS Regression
Contents
ANOVA and OLS Regression#
This notebook combines school demographic data and New York State ELA test scores to examine the factors that predict test scores. In it we run a t-test to test for statistic difference in the scores for White students and Black students.
We then run analysis of variance between the 4 main ethnic/racial groups to see if there is statistical significance in the outcomes. We run an OLS regression on these groups with the “All Students” as a baseline reference category and display the ANOVA table and regression summary.
Finally, one run a different OLS regression to examine the school demographic factors and their effect on test scores.
In this notebook:
t-test (review)
ANOVA (analysis of variance)
OLS (ordinary least squares) regression
# load the demographic data
import pandas as pd
import numpy as np
import scipy as scipy
import pingouin as pg
import statsmodels.api as sm
from IPython.display import Markdown as md
from nycschools import schools, exams
# load the demographic data and merge it with the ELA data
df = schools.load_school_demographics()
# load the data from the csv file
ela = exams.load_ela()
#drop the rows with NaN (where the pop is too small to report)
ela = ela[ela["mean_scale_score"].notnull()]
df = df.merge(ela, how="inner", on=["dbn", "ay"])
# for this analysis we will only look at grade 8 scores for the 2018-19 (pre-covid) school year
# the last pre-covid year
df = df[df.grade =='8']
df = df[df.ay == 2018]
df = df[df.mean_scale_score.notnull()]
# create 5 groups as independent data frames
all_students = df[df["category"] == "All Students"][["dbn", "mean_scale_score"]]
black = df[df["category"] == "Black"][["dbn", "mean_scale_score"]]
white = df[df["category"] == "White"][["dbn", "mean_scale_score"]]
hispanic = df[df["category"] == "Hispanic"][["dbn", "mean_scale_score"]]
asian = df[df["category"] == "Asian"][["dbn", "mean_scale_score"]]
t-test: white and black students#
Before running the ANOVA and further analysis, we will run a t-test between Black and White students to determine if there is a significant difference between their average test score.
# calculate the mean test score and standard deviation for each group
mean_std = df.groupby('category').agg(Mean=('mean_scale_score', np.mean), STD=('mean_scale_score', np.std))
display(md("**Mean average and standard deviation of test scores for each group.**"))
display(mean_std)
# run a t-test to see if there is a statistical difference between white and black student scores
t = scipy.stats.ttest_ind(white["mean_scale_score"],black["mean_scale_score"])
display(md(f"""
**T-Test results** comparing school averages of
White (`n={white["dbn"].count()}`) and Black (`n={black["dbn"].count()}`)
students in 8th grade student ELA scores for 2019-20 academic year.
- White students: M={white["mean_scale_score"].mean():.04f}, SD={white["mean_scale_score"].std():.04f}
- Black students: M={black["mean_scale_score"].mean()}, SD={black["mean_scale_score"].std():.04f}
- T-score: {round(t.statistic, 4)}, p-val: {round(t.pvalue, 4)}
`n` values report the number of schools observed, not the number of test takers. Further analysis
will report on the t-test for weighted means that account for school size.
We see that there is a statistically significance difference in test scores between the groups.
"""))
Mean average and standard deviation of test scores for each group.
Mean | STD | |
---|---|---|
category | ||
All Students | 601.561093 | 9.266628 |
Asian | 610.011383 | 10.234169 |
Black | 598.277534 | 7.550302 |
Current ELL | 577.753650 | 6.179677 |
Econ Disadv | 601.216605 | 8.690372 |
Ever ELL | 604.167463 | 6.260445 |
Female | 603.935732 | 9.382392 |
Hispanic | 598.379582 | 8.639729 |
Male | 597.740034 | 9.912043 |
Never ELL | 602.036189 | 8.229933 |
Not Econ Disadv | 605.578228 | 9.724134 |
Not SWD | 603.212482 | 9.386994 |
SWD | 587.826847 | 7.421250 |
White | 607.967339 | 11.660535 |
T-Test results comparing school averages of
White (n=191
) and Black (n=348
)
students in 8th grade student ELA scores for 2019-20 academic year.
White students: M=607.9673, SD=11.6605
Black students: M=598.2775343971265, SD=7.5503
T-score: 11.675, p-val: 0.0
n
values report the number of schools observed, not the number of test takers. Further analysis
will report on the t-test for weighted means that account for school size.
We see that there is a statistically significance difference in test scores between the groups.
ANOVA & f-values#
In the example below we calculate the f-statistic to see if there are significant differences in test scores based on racial/ethnic group of the test takers. We compare the four main groups in NYC Schools: Asian, Black, Hispanic, and White.
# run a one way anova to test if there is significant difference between the
# average test scores at the school level of 4 different racial/ethnic groups
fvalue, pvalue = scipy.stats.f_oneway(
asian["mean_scale_score"],
black["mean_scale_score"],
hispanic["mean_scale_score"],
white["mean_scale_score"])
results = f"""
A **one-way between subjects ANOVA** was conducted to compare the effect of
racial/ethnic group on the test score for 8th grade NYS ELA exams for
during the 2018-2019 academic year.
The four groups in the test are: Asian (n={len(asian)}), Black (n={len(black)}), Latinx (n={len(hispanic)}),
and White (n={len(white)}) students.
The was a significant effect of racial/ethnic group on test score at
the p<.001 level for the four conditions, [p={pvalue:.04f}, F={fvalue:.04f}].
_`n` values are the number of schools reported, not the number of test takers in each group_
"""
md(results)
A one-way between subjects ANOVA was conducted to compare the effect of racial/ethnic group on the test score for 8th grade NYS ELA exams for during the 2018-2019 academic year.
The four groups in the test are: Asian (n=187), Black (n=348), Latinx (n=404), and White (n=191) students.
The was a significant effect of racial/ethnic group on test score at the p<.001 level for the four conditions, [p=0.0000, F=113.9984].
n
values are the number of schools reported, not the number of test takers in each group
Pingouin results#
When we ran the t-test we saw that pingouin
offers some useful additional features.
The API (syntax for using) pingouin
differes from scipy
. Before we run the test, we
create a new single dataframe with just the columns we care about. We tell the function which column
is the dependent variable and which specifies the groups.
Below we show the pinguion ANOVA.
# pg.anova()
data = df.copy()
data = data[data.category.isin(["Asian", "Black", "Hispanic", "White"])]
data[["category", "mean_scale_score"]]
pg.anova(dv='mean_scale_score', between='category', data=data, detailed=True)
Source | SS | DF | MS | F | p-unc | np2 | |
---|---|---|---|---|---|---|---|
0 | category | 28908.194891 | 3 | 9636.064964 | 113.998413 | 1.832508e-64 | 0.232968 |
1 | Within | 95178.598593 | 1126 | 84.528063 | NaN | NaN | NaN |
OLS Linear Regression#
The ANOVA tells us that there is a significant difference in test result based on racial/ethnic group. We can run a regression analysis to help us isolate the impact of different factors on our mean_scale_score
– our dependent variable.
For this analysis we will look at the school demographics to analyze the mean ELA test score for All Students at the school.
# first choose the "factors" from our data fields that we believe impact mean_scale_score
data = df.copy()
data = data[data.category == "All Students"]
factors = [
'total_enrollment',
'female_pct',
'asian_pct',
'black_pct',
'hispanic_pct',
'white_pct',
'swd_pct',
'ell_pct',
'poverty_pct',
'eni_pct',
'charter']
y = data['mean_scale_score']
X = data[factors]
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
model.summary()
Dep. Variable: | mean_scale_score | R-squared: | 0.699 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.694 |
Method: | Least Squares | F-statistic: | 131.4 |
Date: | Tue, 28 Feb 2023 | Prob (F-statistic): | 3.87e-154 |
Time: | 17:41:20 | Log-Likelihood: | -1929.9 |
No. Observations: | 634 | AIC: | 3884. |
Df Residuals: | 622 | BIC: | 3937. |
Df Model: | 11 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 615.1489 | 10.519 | 58.479 | 0.000 | 594.492 | 635.806 |
total_enrollment | -0.0002 | 0.001 | -0.283 | 0.777 | -0.002 | 0.001 |
female_pct | 8.4008 | 2.380 | 3.530 | 0.000 | 3.727 | 13.074 |
asian_pct | 12.3692 | 11.826 | 1.046 | 0.296 | -10.855 | 35.593 |
black_pct | -1.9667 | 11.588 | -0.170 | 0.865 | -24.723 | 20.790 |
hispanic_pct | 4.0233 | 11.462 | 0.351 | 0.726 | -18.485 | 26.532 |
white_pct | 6.6385 | 11.235 | 0.591 | 0.555 | -15.425 | 28.702 |
swd_pct | -38.2322 | 3.837 | -9.964 | 0.000 | -45.767 | -30.697 |
ell_pct | -35.6202 | 2.406 | -14.807 | 0.000 | -40.344 | -30.896 |
poverty_pct | -8.5422 | 3.808 | -2.243 | 0.025 | -16.020 | -1.065 |
eni_pct | -2.6267 | 3.664 | -0.717 | 0.474 | -9.821 | 4.568 |
charter | 2.1596 | 0.598 | 3.614 | 0.000 | 0.986 | 3.333 |
Omnibus: | 34.045 | Durbin-Watson: | 1.670 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 69.095 |
Skew: | 0.326 | Prob(JB): | 9.92e-16 |
Kurtosis: | 4.480 | Cond. No. | 9.05e+04 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.05e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
# we can also pull specific data from the model
# here we create our own table with the factors, coefficients and p-values
params = list(model.params.index.values[1:])
coefs = list(model.params.values[1:],)
pvalues = list(model.pvalues[1:])
table = pd.DataFrame({"factor":params,"coef":coefs,"p-values":pvalues})
table.sort_values(by="coef")
factor | coef | p-values | |
---|---|---|---|
6 | swd_pct | -38.232201 | 8.505317e-22 |
7 | ell_pct | -35.620239 | 1.030998e-42 |
8 | poverty_pct | -8.542162 | 2.522178e-02 |
9 | eni_pct | -2.626731 | 4.736688e-01 |
3 | black_pct | -1.966710 | 8.652873e-01 |
0 | total_enrollment | -0.000191 | 7.772975e-01 |
10 | charter | 2.159607 | 3.258558e-04 |
4 | hispanic_pct | 4.023272 | 7.256952e-01 |
5 | white_pct | 6.638455 | 5.548231e-01 |
1 | female_pct | 8.400780 | 4.460913e-04 |
2 | asian_pct | 12.369227 | 2.960056e-01 |