Loading Data
============
This notebook loads the NYC School Demographic data.
Working with this dataset, we look at some basic
[Pandas](https://pandas.pydata.org/) operations.

We assume that you have a basic understanding of Python and
Jupyter notebooks. This is a good start if you are new to
Pandas and data science in Python.

In particular:

- loading data from the `nycschools` into a `DataFrame`
- use `head()`, `tail()`, and `Series` (columns) to understand the data
- access column `Series` by name using index notation
- use `unique()`, `min()`, `max()`, `sum()`, and `mean()` to understand series data



In [5]:
# import schools from the nycschool package
from nycschools import schools
# load the demographic data into a `DataFrame` called df
df = schools.load_school_demographics()


Displaying data tables
-----------------------
If we display `df` notebook shows us some of the data from 
the start of the data set and some from the end.

If we call `df.head()` we get the start of the data. `df.tail()` shows us the end of the data.

Comment/uncomment the different options to see how they work.

In [6]:
df

# df.head()
# df.tail()

Unnamed: 0,dbn,beds,district,geo_district,boro,school_name,short_name,ay,year,total_enrollment,...,missing_race_ethnicity_data_pct,swd_n,swd_pct,ell_n,ell_pct,poverty_n,poverty_pct,eni_pct,clean_name,zip
0,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2016,2016-17,178,...,0.000000,51,0.287000,12,0.067,152,0.854,0.882,roberto clemente,10009
1,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2017,2017-18,190,...,0.000000,49,0.258000,8,0.042,161,0.847,0.890,roberto clemente,10009
2,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2018,2018-19,174,...,0.000000,39,0.224000,8,0.046,147,0.845,0.888,roberto clemente,10009
3,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2019,2019-20,190,...,0.000000,46,0.242000,17,0.089,155,0.816,0.867,roberto clemente,10009
4,01M015,310100010015,1,1,Manhattan,P.S. 015 Roberto Clemente,PS 15,2020,2020-21,193,...,0.000000,43,0.223000,21,0.109,158,0.819,0.856,roberto clemente,10009
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9996,84X730,320800860846,84,8,Bronx,Bronx Charter School for the Arts,PS 730,2016,2016-17,320,...,0.000000,67,0.209375,51,0.159,235,0.734,0.840,bronx charter school for the arts,10474
9997,84X730,320800860846,84,8,Bronx,Bronx Charter School for the Arts,PS 730,2017,2017-18,314,...,0.000000,68,0.216561,57,0.182,258,0.822,0.891,bronx charter school for the arts,10474
9998,84X730,320800860846,84,8,Bronx,Bronx Charter School for the Arts,PS 730,2018,2018-19,430,...,0.000000,103,0.239535,71,0.165,363,0.844,0.888,bronx charter school for the arts,10474
9999,84X730,320800860846,84,8,Bronx,Bronx Charter School for the Arts,MS 730,2019,2019-20,523,...,0.000000,117,0.223709,69,0.132,453,0.866,0.892,bronx charter school for the arts,10474


In [7]:
# The the `columns` property shows us the names of the cols in our `df`.
df.columns

Index(['dbn', 'beds', 'district', 'geo_district', 'boro', 'school_name',
       'short_name', 'ay', 'year', 'total_enrollment',
       'grade_3k_pk_half_day_full', 'grade_k', 'grade_1', 'grade_2', 'grade_3',
       'grade_4', 'grade_5', 'grade_6', 'grade_7', 'grade_8', 'grade_9',
       'grade_10', 'grade_11', 'grade_12', 'female_n', 'female_pct', 'male_n',
       'male_pct', 'asian_n', 'asian_pct', 'black_n', 'black_pct',
       'hispanic_n', 'hispanic_pct', 'multi_racial_n', 'multi_racial_pct',
       'native_american_n', 'native_american_pct', 'white_n', 'white_pct',
       'missing_race_ethnicity_data_n', 'missing_race_ethnicity_data_pct',
       'swd_n', 'swd_pct', 'ell_n', 'ell_pct', 'poverty_n', 'poverty_pct',
       'eni_pct', 'clean_name', 'zip'],
      dtype='object')

We can access just one column using index notation -- 
`df["poverty"]` gives us just that column. We can then display or sort the data using either
the python built-in function `sorted()` or the pandas `Series` function `sort_values()`.
If we call `unique()` we will get a list of the unique values in the `Series`. In the case
of `poverty` that lets us see that the column contains string data, and not raw numbers
if the poverty level is too high or too low.

In [8]:
# get just the poverty column
poverty = df["poverty_pct"]

# pandas also supports "dot notation" for access to columns
# but you can't use dot notation if the column name is not a valid identifier, python keyword, 
# or name of a property or member function of the dataframe

poverty = df.poverty_pct # same as line 2 above

poverty = poverty.sort_values()
print("Note: precentages are displayed as real numbers between 0..1")
print(poverty.unique()) 


Note: precentages are displayed as real numbers between 0..1
[0.04  0.05  0.056 0.057 0.059 0.06  0.061 0.062 0.063 0.064 0.065 0.067
 0.069 0.07  0.072 0.075 0.076 0.077 0.079 0.082 0.083 0.084 0.086 0.087
 0.088 0.089 0.09  0.091 0.092 0.093 0.094 0.097 0.098 0.1   0.101 0.103
 0.104 0.106 0.108 0.11  0.111 0.112 0.113 0.114 0.115 0.116 0.117 0.118
 0.12  0.121 0.122 0.123 0.124 0.126 0.128 0.129 0.13  0.131 0.132 0.134
 0.135 0.136 0.137 0.138 0.139 0.14  0.141 0.142 0.144 0.147 0.148 0.15
 0.151 0.152 0.153 0.155 0.156 0.157 0.159 0.16  0.161 0.162 0.163 0.165
 0.167 0.168 0.169 0.17  0.171 0.172 0.176 0.177 0.18  0.181 0.182 0.183
 0.185 0.186 0.187 0.188 0.19  0.191 0.192 0.193 0.195 0.196 0.197 0.198
 0.2   0.201 0.202 0.205 0.206 0.208 0.209 0.211 0.212 0.213 0.214 0.215
 0.216 0.217 0.219 0.22  0.221 0.222 0.223 0.225 0.226 0.229 0.23  0.231
 0.232 0.233 0.235 0.237 0.239 0.24  0.241 0.242 0.244 0.245 0.246 0.247
 0.248 0.249 0.251 0.252 0.253 0.254 0.255 0.256 0.257 0.258 0.2

We can get a subset of the data by using index notation with
a list of column names:

`df[ ["dbn", "school_name", "total_enrollment", "poverty_n" ] ]` returns a `DataFrame`
with 4 columns.

In [9]:
df[ ["dbn", "school_name", "total_enrollment", "poverty_n" ] ]

Unnamed: 0,dbn,school_name,total_enrollment,poverty_n
0,01M015,P.S. 015 Roberto Clemente,178,152
1,01M015,P.S. 015 Roberto Clemente,190,161
2,01M015,P.S. 015 Roberto Clemente,174,147
3,01M015,P.S. 015 Roberto Clemente,190,155
4,01M015,P.S. 015 Roberto Clemente,193,158
...,...,...,...,...
9996,84X730,Bronx Charter School for the Arts,320,235
9997,84X730,Bronx Charter School for the Arts,314,258
9998,84X730,Bronx Charter School for the Arts,430,363
9999,84X730,Bronx Charter School for the Arts,523,453
