Line & Scatter Plots#

This notebook introduces the methods of creating bar charts from our data, including some of the key concepts of our plotting package, matplotlib. Bar charts are useful for making comparisons between close data because we are very good at evaluating the size of rectangles.

Topics covered:

  • bar charts

  • styling charts (titles, labels, size, color, etc)

  • grouped bar charts

  • tacked bar charts

  • new school_data package for custom code

To work on these examples we will use the demographics and ELA/Math test score data.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy
from scipy.stats import pearsonr
from IPython.display import Markdown as md

from nycschools import schools, ui, exams
df = schools.load_school_demographics()
tests = exams.load_math_ela_long()

Scatter plots and correlations#

In the next section we’re going to see what demographic factors of schools impact student test scores on the NYS ELA exams, grades 3-8.

factors = ['total_enrollment', 'asian_pct', 'black_pct', 'hispanic_pct', 'white_pct', 'swd_pct', 'ell_pct', 'poverty_pct', 'eni_pct']

# create a table of the correlation matrix with a cool->warm color map
corr = df[factors].corr()
corr.style.background_gradient(cmap='coolwarm')
  total_enrollment asian_pct black_pct hispanic_pct white_pct swd_pct ell_pct poverty_pct eni_pct
total_enrollment 1.000000 0.350720 -0.251269 -0.095415 0.176726 -0.187808 -0.005513 -0.149179 -0.189138
asian_pct 0.350720 1.000000 -0.450250 -0.346040 0.191238 -0.213357 0.151237 -0.258741 -0.307453
black_pct -0.251269 -0.450250 1.000000 -0.428332 -0.456705 0.125320 -0.356624 0.303223 0.250974
hispanic_pct -0.095415 -0.346040 -0.428332 1.000000 -0.385142 0.077647 0.442527 0.465564 0.531882
white_pct 0.176726 0.191238 -0.456705 -0.385142 1.000000 -0.080427 -0.186733 -0.790558 -0.760384
swd_pct -0.187808 -0.213357 0.125320 0.077647 -0.080427 1.000000 0.031075 0.191904 0.266450
ell_pct -0.005513 0.151237 -0.356624 0.442527 -0.186733 0.031075 1.000000 0.346132 0.396029
poverty_pct -0.149179 -0.258741 0.303223 0.465564 -0.790558 0.191904 0.346132 1.000000 0.912243
eni_pct -0.189138 -0.307453 0.250974 0.531882 -0.760384 0.266450 0.396029 0.912243 1.000000
# scatter plot of all of the factors
sns.pairplot(df[factors], kind="scatter")
pass
../../_images/03-line-and-scatter-charts_5_0.png
sns.regplot(x=df["poverty_pct"], y=df["ell_pct"], line_kws={"color": "red"} )
plt.show()
../../_images/03-line-and-scatter-charts_6_0.png
g = sns.regplot(x=df["poverty_pct"], y=df["ell_pct"], line_kws={"color": "red"} )
g.figure.set_size_inches(18, 11)
g.set(title="% ELL and % Poverty")


# show the correlation plot in jupyter
r, p = pearsonr(df['poverty_pct'], df['ell_pct'])

degs_f = len(df) - 2

report_r = f"""
**A Pearson's correlation coefficient** was computed to assess the linear relationship 
between _% poverty_ and _% English Language Learners_ (ELL).

There was a positive correlation between the two variables, r({degs_f}) = {r:.2f}, p = {p:.3f}.
"""

display(md(report_r))
plt.show()

A Pearson's correlation coefficient was computed to assess the linear relationship between % poverty and % English Language Learners (ELL).

There was a positive correlation between the two variables, r(9999) = 0.35, p = 0.000.

../../_images/03-line-and-scatter-charts_7_1.png
# plot highly correlated % poverty and economic need index -- both measures look at poverty in schools
g = sns.regplot(x=df["poverty_pct"], y=df["eni_pct"], line_kws={"color": "red"} )
g.figure.set_size_inches(18, 11)

g.set(title="% Poverty and % Economic Need")

plt.show()
../../_images/03-line-and-scatter-charts_8_0.png

Drop outliers#

For the next correlation, first we’ll drop the outliers by computing z-scores. Outliers can distort the results of our correlations.

outliers = df[["white_pct", "total_enrollment"]]
x = outliers["total_enrollment"]
y = outliers["white_pct"]
g = sns.regplot(x=x, y=y, line_kws={"color": "red"})
g.figure.set_size_inches(18, 11)
r, p = pearsonr(x, y)
# use the ui helper funciont fmt_pearson to report the r and p values from our correlation
g.set(title=f"Total Enrollment and % White -- with outliers (r({len(x)-2}) = {r:.2f}, p = {p:.3f})")
plt.show()
pass
../../_images/03-line-and-scatter-charts_10_0.png