Loading Data#

This notebook loads the NYC School Demographic data. Working with this dataset, we look at some basic Pandas operations.

We assume that you have a basic understanding of Python and Jupyter notebooks. This is a good start if you are new to Pandas and data science in Python.

In particular:

loading data from the nycschools into a DataFrame
use head(), tail(), and Series (columns) to understand the data
access column Series by name using index notation
use unique(), min(), max(), sum(), and mean() to understand series data

# import schools from the nycschool package
from nycschools import schools
# load the demographic data into a `DataFrame` called df
df = schools.load_school_demographics()

Displaying data tables#

If we display df notebook shows us some of the data from the start of the data set and some from the end.

If we call df.head() we get the start of the data. df.tail() shows us the end of the data.

Comment/uncomment the different options to see how they work.

df

# df.head()
# df.tail()

	dbn	beds	district	geo_district	boro	school_name	short_name	ay	year	total_enrollment	...	missing_race_ethnicity_data_pct	swd_n	swd_pct	ell_n	ell_pct	poverty_n	poverty_pct	eni_pct	clean_name	zip
0	01M015	310100010015	1	1	Manhattan	P.S. 015 Roberto Clemente	PS 15	2016	2016-17	178	...	0.000000	51	0.287000	12	0.067	152	0.854	0.882	roberto clemente	10009
1	01M015	310100010015	1	1	Manhattan	P.S. 015 Roberto Clemente	PS 15	2017	2017-18	190	...	0.000000	49	0.258000	8	0.042	161	0.847	0.890	roberto clemente	10009
2	01M015	310100010015	1	1	Manhattan	P.S. 015 Roberto Clemente	PS 15	2018	2018-19	174	...	0.000000	39	0.224000	8	0.046	147	0.845	0.888	roberto clemente	10009
3	01M015	310100010015	1	1	Manhattan	P.S. 015 Roberto Clemente	PS 15	2019	2019-20	190	...	0.000000	46	0.242000	17	0.089	155	0.816	0.867	roberto clemente	10009
4	01M015	310100010015	1	1	Manhattan	P.S. 015 Roberto Clemente	PS 15	2020	2020-21	193	...	0.000000	43	0.223000	21	0.109	158	0.819	0.856	roberto clemente	10009
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9996	84X730	320800860846	84	8	Bronx	Bronx Charter School for the Arts	PS 730	2016	2016-17	320	...	0.000000	67	0.209375	51	0.159	235	0.734	0.840	bronx charter school for the arts	10474
9997	84X730	320800860846	84	8	Bronx	Bronx Charter School for the Arts	PS 730	2017	2017-18	314	...	0.000000	68	0.216561	57	0.182	258	0.822	0.891	bronx charter school for the arts	10474
9998	84X730	320800860846	84	8	Bronx	Bronx Charter School for the Arts	PS 730	2018	2018-19	430	...	0.000000	103	0.239535	71	0.165	363	0.844	0.888	bronx charter school for the arts	10474
9999	84X730	320800860846	84	8	Bronx	Bronx Charter School for the Arts	MS 730	2019	2019-20	523	...	0.000000	117	0.223709	69	0.132	453	0.866	0.892	bronx charter school for the arts	10474
10000	84X730	320800860846	84	8	Bronx	Bronx Charter School for the Arts	MS 730	2020	2020-21	626	...	0.001597	153	0.244409	78	0.125	541	0.864	0.882	bronx charter school for the arts	10474

10001 rows × 51 columns

# The the `columns` property shows us the names of the cols in our `df`.
df.columns

Index(['dbn', 'beds', 'district', 'geo_district', 'boro', 'school_name',
       'short_name', 'ay', 'year', 'total_enrollment',
       'grade_3k_pk_half_day_full', 'grade_k', 'grade_1', 'grade_2', 'grade_3',
       'grade_4', 'grade_5', 'grade_6', 'grade_7', 'grade_8', 'grade_9',
       'grade_10', 'grade_11', 'grade_12', 'female_n', 'female_pct', 'male_n',
       'male_pct', 'asian_n', 'asian_pct', 'black_n', 'black_pct',
       'hispanic_n', 'hispanic_pct', 'multi_racial_n', 'multi_racial_pct',
       'native_american_n', 'native_american_pct', 'white_n', 'white_pct',
       'missing_race_ethnicity_data_n', 'missing_race_ethnicity_data_pct',
       'swd_n', 'swd_pct', 'ell_n', 'ell_pct', 'poverty_n', 'poverty_pct',
       'eni_pct', 'clean_name', 'zip'],
      dtype='object')

We can access just one column using index notation – df["poverty"] gives us just that column. We can then display or sort the data using either the python built-in function sorted() or the pandas Series function sort_values(). If we call unique() we will get a list of the unique values in the Series. In the case of poverty that lets us see that the column contains string data, and not raw numbers if the poverty level is too high or too low.

# get just the poverty column
poverty = df["poverty_pct"]

# pandas also supports "dot notation" for access to columns
# but you can't use dot notation if the column name is not a valid identifier, python keyword, 
# or name of a property or member function of the dataframe

poverty = df.poverty_pct # same as line 2 above

poverty = poverty.sort_values()
print("Note: precentages are displayed as real numbers between 0..1")
print(poverty.unique()) 

Note: precentages are displayed as real numbers between 0..1
[0.04  0.05  0.056 0.057 0.059 0.06  0.061 0.062 0.063 0.064 0.065 0.067
069 0.07  0.072 0.075 0.076 0.077 0.079 0.082 0.083 0.084 0.086 0.087
088 0.089 0.09  0.091 0.092 0.093 0.094 0.097 0.098 0.1   0.101 0.103
104 0.106 0.108 0.11  0.111 0.112 0.113 0.114 0.115 0.116 0.117 0.118
12  0.121 0.122 0.123 0.124 0.126 0.128 0.129 0.13  0.131 0.132 0.134
135 0.136 0.137 0.138 0.139 0.14  0.141 0.142 0.144 0.147 0.148 0.15
151 0.152 0.153 0.155 0.156 0.157 0.159 0.16  0.161 0.162 0.163 0.165
167 0.168 0.169 0.17  0.171 0.172 0.176 0.177 0.18  0.181 0.182 0.183
185 0.186 0.187 0.188 0.19  0.191 0.192 0.193 0.195 0.196 0.197 0.198
2   0.201 0.202 0.205 0.206 0.208 0.209 0.211 0.212 0.213 0.214 0.215
216 0.217 0.219 0.22  0.221 0.222 0.223 0.225 0.226 0.229 0.23  0.231
232 0.233 0.235 0.237 0.239 0.24  0.241 0.242 0.244 0.245 0.246 0.247
248 0.249 0.251 0.252 0.253 0.254 0.255 0.256 0.257 0.258 0.259 0.26
261 0.262 0.263 0.264 0.267 0.268 0.269 0.27  0.272 0.273 0.274 0.275
276 0.277 0.278 0.279 0.281 0.282 0.283 0.284 0.285 0.286 0.288 0.289
29  0.291 0.292 0.293 0.294 0.295 0.296 0.297 0.298 0.299 0.3   0.302
303 0.304 0.305 0.306 0.307 0.308 0.309 0.31  0.311 0.312 0.313 0.314
315 0.316 0.317 0.318 0.319 0.32  0.321 0.322 0.323 0.325 0.326 0.327
328 0.329 0.33  0.331 0.332 0.333 0.334 0.335 0.336 0.337 0.338 0.339
341 0.342 0.343 0.344 0.345 0.346 0.347 0.348 0.349 0.35  0.351 0.352
353 0.354 0.355 0.356 0.357 0.358 0.359 0.36  0.361 0.362 0.363 0.364
365 0.367 0.368 0.369 0.37  0.371 0.372 0.373 0.374 0.375 0.376 0.378
379 0.38  0.381 0.382 0.383 0.384 0.385 0.386 0.387 0.388 0.389 0.39
391 0.392 0.394 0.395 0.396 0.397 0.398 0.399 0.4   0.401 0.402 0.403
404 0.405 0.406 0.407 0.408 0.409 0.41  0.411 0.412 0.413 0.414 0.415
416 0.417 0.418 0.419 0.42  0.421 0.422 0.423 0.424 0.425 0.426 0.427
428 0.429 0.431 0.432 0.433 0.434 0.435 0.436 0.437 0.438 0.439 0.44
441 0.442 0.443 0.444 0.445 0.446 0.447 0.448 0.449 0.45  0.451 0.452
453 0.454 0.455 0.456 0.457 0.458 0.459 0.46  0.461 0.462 0.463 0.464
465 0.466 0.467 0.468 0.469 0.47  0.472 0.473 0.474 0.475 0.476 0.477
478 0.479 0.48  0.481 0.482 0.484 0.485 0.486 0.487 0.488 0.489 0.49
491 0.492 0.493 0.494 0.495 0.496 0.497 0.498 0.499 0.5   0.501 0.502
503 0.504 0.505 0.507 0.508 0.509 0.51  0.511 0.512 0.513 0.514 0.515
516 0.517 0.518 0.519 0.52  0.521 0.522 0.523 0.524 0.526 0.527 0.528
529 0.53  0.531 0.532 0.533 0.534 0.535 0.536 0.537 0.538 0.539 0.54
541 0.542 0.543 0.544 0.545 0.546 0.547 0.548 0.549 0.55  0.551 0.552
553 0.554 0.555 0.556 0.557 0.558 0.559 0.56  0.561 0.562 0.563 0.564
565 0.566 0.567 0.568 0.569 0.57  0.571 0.572 0.573 0.574 0.575 0.576
577 0.578 0.579 0.58  0.581 0.582 0.583 0.584 0.585 0.586 0.587 0.588
589 0.59  0.591 0.592 0.593 0.594 0.595 0.596 0.597 0.598 0.599 0.6
601 0.602 0.603 0.604 0.605 0.606 0.607 0.608 0.609 0.61  0.611 0.612
613 0.614 0.615 0.616 0.617 0.618 0.619 0.62  0.621 0.622 0.623 0.624
625 0.626 0.627 0.628 0.629 0.63  0.631 0.632 0.633 0.634 0.635 0.636
637 0.638 0.639 0.64  0.641 0.642 0.643 0.644 0.645 0.646 0.647 0.648
649 0.65  0.651 0.652 0.653 0.654 0.655 0.656 0.657 0.658 0.659 0.66
661 0.662 0.663 0.664 0.665 0.666 0.667 0.668 0.669 0.67  0.671 0.672
673 0.674 0.675 0.676 0.677 0.678 0.679 0.68  0.681 0.682 0.683 0.684
685 0.686 0.687 0.688 0.689 0.69  0.691 0.692 0.693 0.694 0.695 0.696
697 0.698 0.699 0.7   0.701 0.702 0.703 0.704 0.705 0.706 0.707 0.708
709 0.71  0.711 0.712 0.713 0.714 0.715 0.716 0.717 0.718 0.719 0.72
721 0.722 0.723 0.724 0.725 0.726 0.727 0.728 0.729 0.73  0.731 0.732
733 0.734 0.735 0.736 0.737 0.738 0.739 0.74  0.741 0.742 0.743 0.744
745 0.746 0.747 0.748 0.749 0.75  0.751 0.752 0.753 0.754 0.755 0.756
757 0.758 0.759 0.76  0.761 0.762 0.763 0.764 0.765 0.766 0.767 0.768
769 0.77  0.771 0.772 0.773 0.774 0.775 0.776 0.777 0.778 0.779 0.78
781 0.782 0.783 0.784 0.785 0.786 0.787 0.788 0.789 0.79  0.791 0.792
793 0.794 0.795 0.796 0.797 0.798 0.799 0.8   0.801 0.802 0.803 0.804
805 0.806 0.807 0.808 0.809 0.81  0.811 0.812 0.813 0.814 0.815 0.816
817 0.818 0.819 0.82  0.821 0.822 0.823 0.824 0.825 0.826 0.827 0.828
829 0.83  0.831 0.832 0.833 0.834 0.835 0.836 0.837 0.838 0.839 0.84
841 0.842 0.843 0.844 0.845 0.846 0.847 0.848 0.849 0.85  0.851 0.852
853 0.854 0.855 0.856 0.857 0.858 0.859 0.86  0.861 0.862 0.863 0.864
865 0.866 0.867 0.868 0.869 0.87  0.871 0.872 0.873 0.874 0.875 0.876
877 0.878 0.879 0.88  0.881 0.882 0.883 0.884 0.885 0.886 0.887 0.888
889 0.89  0.891 0.892 0.893 0.894 0.895 0.896 0.897 0.898 0.899 0.9
901 0.902 0.903 0.904 0.905 0.906 0.907 0.908 0.909 0.91  0.911 0.912
913 0.914 0.915 0.916 0.917 0.918 0.919 0.92  0.921 0.922 0.923 0.924
925 0.926 0.927 0.928 0.929 0.93  0.931 0.932 0.933 0.934 0.935 0.936
937 0.938 0.939 0.94  0.941 0.942 0.943 0.944 0.945 0.946 0.947 0.948
949 0.95  0.96 ]

We can get a subset of the data by using index notation with a list of column names:

df[ ["dbn", "school_name", "total_enrollment", "poverty_n" ] ] returns a DataFrame with 4 columns.

df[ ["dbn", "school_name", "total_enrollment", "poverty_n" ] ]

	dbn	school_name	total_enrollment	poverty_n
0	01M015	P.S. 015 Roberto Clemente	178	152
1	01M015	P.S. 015 Roberto Clemente	190	161
2	01M015	P.S. 015 Roberto Clemente	174	147
3	01M015	P.S. 015 Roberto Clemente	190	155
4	01M015	P.S. 015 Roberto Clemente	193	158
...	...	...	...	...
9996	84X730	Bronx Charter School for the Arts	320	235
9997	84X730	Bronx Charter School for the Arts	314	258
9998	84X730	Bronx Charter School for the Arts	430	363
9999	84X730	Bronx Charter School for the Arts	523	453
10000	84X730	Bronx Charter School for the Arts	626	541

10001 rows × 4 columns

NYC Schools Open Data Portal

Loading Data

Contents

Loading Data#

Displaying data tables#