Loading Data#

This notebook loads the NYC School Demographic data. Working with this dataset, we look at some basic Pandas operations.

We assume that you have a basic understanding of Python and Jupyter notebooks. This is a good start if you are new to Pandas and data science in Python.

In particular:

  • loading data from the nycschools into a DataFrame

  • use head(), tail(), and Series (columns) to understand the data

  • access column Series by name using index notation

  • use unique(), min(), max(), sum(), and mean() to understand series data

# import schools from the nycschool package
from nycschools import schools
# load the demographic data into a `DataFrame` called df
df = schools.load_school_demographics()

Displaying data tables#

If we display df notebook shows us some of the data from the start of the data set and some from the end.

If we call df.head() we get the start of the data. df.tail() shows us the end of the data.

Comment/uncomment the different options to see how they work.

df

# df.head()
# df.tail()
dbn beds district geo_district boro school_name short_name ay year total_enrollment ... missing_race_ethnicity_data_pct swd_n swd_pct ell_n ell_pct poverty_n poverty_pct eni_pct clean_name zip
0 01M015 310100010015 1 1 Manhattan P.S. 015 Roberto Clemente PS 15 2016 2016-17 178 ... 0.000000 51 0.287000 12 0.067 152 0.854 0.882 roberto clemente 10009
1 01M015 310100010015 1 1 Manhattan P.S. 015 Roberto Clemente PS 15 2017 2017-18 190 ... 0.000000 49 0.258000 8 0.042 161 0.847 0.890 roberto clemente 10009
2 01M015 310100010015 1 1 Manhattan P.S. 015 Roberto Clemente PS 15 2018 2018-19 174 ... 0.000000 39 0.224000 8 0.046 147 0.845 0.888 roberto clemente 10009
3 01M015 310100010015 1 1 Manhattan P.S. 015 Roberto Clemente PS 15 2019 2019-20 190 ... 0.000000 46 0.242000 17 0.089 155 0.816 0.867 roberto clemente 10009
4 01M015 310100010015 1 1 Manhattan P.S. 015 Roberto Clemente PS 15 2020 2020-21 193 ... 0.000000 43 0.223000 21 0.109 158 0.819 0.856 roberto clemente 10009
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9996 84X730 320800860846 84 8 Bronx Bronx Charter School for the Arts PS 730 2016 2016-17 320 ... 0.000000 67 0.209375 51 0.159 235 0.734 0.840 bronx charter school for the arts 10474
9997 84X730 320800860846 84 8 Bronx Bronx Charter School for the Arts PS 730 2017 2017-18 314 ... 0.000000 68 0.216561 57 0.182 258 0.822 0.891 bronx charter school for the arts 10474
9998 84X730 320800860846 84 8 Bronx Bronx Charter School for the Arts PS 730 2018 2018-19 430 ... 0.000000 103 0.239535 71 0.165 363 0.844 0.888 bronx charter school for the arts 10474
9999 84X730 320800860846 84 8 Bronx Bronx Charter School for the Arts MS 730 2019 2019-20 523 ... 0.000000 117 0.223709 69 0.132 453 0.866 0.892 bronx charter school for the arts 10474
10000 84X730 320800860846 84 8 Bronx Bronx Charter School for the Arts MS 730 2020 2020-21 626 ... 0.001597 153 0.244409 78 0.125 541 0.864 0.882 bronx charter school for the arts 10474

10001 rows × 51 columns

# The the `columns` property shows us the names of the cols in our `df`.
df.columns
Index(['dbn', 'beds', 'district', 'geo_district', 'boro', 'school_name',
       'short_name', 'ay', 'year', 'total_enrollment',
       'grade_3k_pk_half_day_full', 'grade_k', 'grade_1', 'grade_2', 'grade_3',
       'grade_4', 'grade_5', 'grade_6', 'grade_7', 'grade_8', 'grade_9',
       'grade_10', 'grade_11', 'grade_12', 'female_n', 'female_pct', 'male_n',
       'male_pct', 'asian_n', 'asian_pct', 'black_n', 'black_pct',
       'hispanic_n', 'hispanic_pct', 'multi_racial_n', 'multi_racial_pct',
       'native_american_n', 'native_american_pct', 'white_n', 'white_pct',
       'missing_race_ethnicity_data_n', 'missing_race_ethnicity_data_pct',
       'swd_n', 'swd_pct', 'ell_n', 'ell_pct', 'poverty_n', 'poverty_pct',
       'eni_pct', 'clean_name', 'zip'],
      dtype='object')

We can access just one column using index notation – df["poverty"] gives us just that column. We can then display or sort the data using either the python built-in function sorted() or the pandas Series function sort_values(). If we call unique() we will get a list of the unique values in the Series. In the case of poverty that lets us see that the column contains string data, and not raw numbers if the poverty level is too high or too low.

# get just the poverty column
poverty = df["poverty_pct"]

# pandas also supports "dot notation" for access to columns
# but you can't use dot notation if the column name is not a valid identifier, python keyword, 
# or name of a property or member function of the dataframe

poverty = df.poverty_pct # same as line 2 above

poverty = poverty.sort_values()
print("Note: precentages are displayed as real numbers between 0..1")
print(poverty.unique()) 
Note: precentages are displayed as real numbers between 0..1
[0.04  0.05  0.056 0.057 0.059 0.06  0.061 0.062 0.063 0.064 0.065 0.067
 0.069 0.07  0.072 0.075 0.076 0.077 0.079 0.082 0.083 0.084 0.086 0.087
 0.088 0.089 0.09  0.091 0.092 0.093 0.094 0.097 0.098 0.1   0.101 0.103
 0.104 0.106 0.108 0.11  0.111 0.112 0.113 0.114 0.115 0.116 0.117 0.118
 0.12  0.121 0.122 0.123 0.124 0.126 0.128 0.129 0.13  0.131 0.132 0.134
 0.135 0.136 0.137 0.138 0.139 0.14  0.141 0.142 0.144 0.147 0.148 0.15
 0.151 0.152 0.153 0.155 0.156 0.157 0.159 0.16  0.161 0.162 0.163 0.165
 0.167 0.168 0.169 0.17  0.171 0.172 0.176 0.177 0.18  0.181 0.182 0.183
 0.185 0.186 0.187 0.188 0.19  0.191 0.192 0.193 0.195 0.196 0.197 0.198
 0.2   0.201 0.202 0.205 0.206 0.208 0.209 0.211 0.212 0.213 0.214 0.215
 0.216 0.217 0.219 0.22  0.221 0.222 0.223 0.225 0.226 0.229 0.23  0.231
 0.232 0.233 0.235 0.237 0.239 0.24  0.241 0.242 0.244 0.245 0.246 0.247
 0.248 0.249 0.251 0.252 0.253 0.254 0.255 0.256 0.257 0.258 0.259 0.26
 0.261 0.262 0.263 0.264 0.267 0.268 0.269 0.27  0.272 0.273 0.274 0.275
 0.276 0.277 0.278 0.279 0.281 0.282 0.283 0.284 0.285 0.286 0.288 0.289
 0.29  0.291 0.292 0.293 0.294 0.295 0.296 0.297 0.298 0.299 0.3   0.302
 0.303 0.304 0.305 0.306 0.307 0.308 0.309 0.31  0.311 0.312 0.313 0.314
 0.315 0.316 0.317 0.318 0.319 0.32  0.321 0.322 0.323 0.325 0.326 0.327
 0.328 0.329 0.33  0.331 0.332 0.333 0.334 0.335 0.336 0.337 0.338 0.339
 0.341 0.342 0.343 0.344 0.345 0.346 0.347 0.348 0.349 0.35  0.351 0.352
 0.353 0.354 0.355 0.356 0.357 0.358 0.359 0.36  0.361 0.362 0.363 0.364
 0.365 0.367 0.368 0.369 0.37  0.371 0.372 0.373 0.374 0.375 0.376 0.378
 0.379 0.38  0.381 0.382 0.383 0.384 0.385 0.386 0.387 0.388 0.389 0.39
 0.391 0.392 0.394 0.395 0.396 0.397 0.398 0.399 0.4   0.401 0.402 0.403
 0.404 0.405 0.406 0.407 0.408 0.409 0.41  0.411 0.412 0.413 0.414 0.415
 0.416 0.417 0.418 0.419 0.42  0.421 0.422 0.423 0.424 0.425 0.426 0.427
 0.428 0.429 0.431 0.432 0.433 0.434 0.435 0.436 0.437 0.438 0.439 0.44
 0.441 0.442 0.443 0.444 0.445 0.446 0.447 0.448 0.449 0.45  0.451 0.452
 0.453 0.454 0.455 0.456 0.457 0.458 0.459 0.46  0.461 0.462 0.463 0.464
 0.465 0.466 0.467 0.468 0.469 0.47  0.472 0.473 0.474 0.475 0.476 0.477
 0.478 0.479 0.48  0.481 0.482 0.484 0.485 0.486 0.487 0.488 0.489 0.49
 0.491 0.492 0.493 0.494 0.495 0.496 0.497 0.498 0.499 0.5   0.501 0.502
 0.503 0.504 0.505 0.507 0.508 0.509 0.51  0.511 0.512 0.513 0.514 0.515
 0.516 0.517 0.518 0.519 0.52  0.521 0.522 0.523 0.524 0.526 0.527 0.528
 0.529 0.53  0.531 0.532 0.533 0.534 0.535 0.536 0.537 0.538 0.539 0.54
 0.541 0.542 0.543 0.544 0.545 0.546 0.547 0.548 0.549 0.55  0.551 0.552
 0.553 0.554 0.555 0.556 0.557 0.558 0.559 0.56  0.561 0.562 0.563 0.564
 0.565 0.566 0.567 0.568 0.569 0.57  0.571 0.572 0.573 0.574 0.575 0.576
 0.577 0.578 0.579 0.58  0.581 0.582 0.583 0.584 0.585 0.586 0.587 0.588
 0.589 0.59  0.591 0.592 0.593 0.594 0.595 0.596 0.597 0.598 0.599 0.6
 0.601 0.602 0.603 0.604 0.605 0.606 0.607 0.608 0.609 0.61  0.611 0.612
 0.613 0.614 0.615 0.616 0.617 0.618 0.619 0.62  0.621 0.622 0.623 0.624
 0.625 0.626 0.627 0.628 0.629 0.63  0.631 0.632 0.633 0.634 0.635 0.636
 0.637 0.638 0.639 0.64  0.641 0.642 0.643 0.644 0.645 0.646 0.647 0.648
 0.649 0.65  0.651 0.652 0.653 0.654 0.655 0.656 0.657 0.658 0.659 0.66
 0.661 0.662 0.663 0.664 0.665 0.666 0.667 0.668 0.669 0.67  0.671 0.672
 0.673 0.674 0.675 0.676 0.677 0.678 0.679 0.68  0.681 0.682 0.683 0.684
 0.685 0.686 0.687 0.688 0.689 0.69  0.691 0.692 0.693 0.694 0.695 0.696
 0.697 0.698 0.699 0.7   0.701 0.702 0.703 0.704 0.705 0.706 0.707 0.708
 0.709 0.71  0.711 0.712 0.713 0.714 0.715 0.716 0.717 0.718 0.719 0.72
 0.721 0.722 0.723 0.724 0.725 0.726 0.727 0.728 0.729 0.73  0.731 0.732
 0.733 0.734 0.735 0.736 0.737 0.738 0.739 0.74  0.741 0.742 0.743 0.744
 0.745 0.746 0.747 0.748 0.749 0.75  0.751 0.752 0.753 0.754 0.755 0.756
 0.757 0.758 0.759 0.76  0.761 0.762 0.763 0.764 0.765 0.766 0.767 0.768
 0.769 0.77  0.771 0.772 0.773 0.774 0.775 0.776 0.777 0.778 0.779 0.78
 0.781 0.782 0.783 0.784 0.785 0.786 0.787 0.788 0.789 0.79  0.791 0.792
 0.793 0.794 0.795 0.796 0.797 0.798 0.799 0.8   0.801 0.802 0.803 0.804
 0.805 0.806 0.807 0.808 0.809 0.81  0.811 0.812 0.813 0.814 0.815 0.816
 0.817 0.818 0.819 0.82  0.821 0.822 0.823 0.824 0.825 0.826 0.827 0.828
 0.829 0.83  0.831 0.832 0.833 0.834 0.835 0.836 0.837 0.838 0.839 0.84
 0.841 0.842 0.843 0.844 0.845 0.846 0.847 0.848 0.849 0.85  0.851 0.852
 0.853 0.854 0.855 0.856 0.857 0.858 0.859 0.86  0.861 0.862 0.863 0.864
 0.865 0.866 0.867 0.868 0.869 0.87  0.871 0.872 0.873 0.874 0.875 0.876
 0.877 0.878 0.879 0.88  0.881 0.882 0.883 0.884 0.885 0.886 0.887 0.888
 0.889 0.89  0.891 0.892 0.893 0.894 0.895 0.896 0.897 0.898 0.899 0.9
 0.901 0.902 0.903 0.904 0.905 0.906 0.907 0.908 0.909 0.91  0.911 0.912
 0.913 0.914 0.915 0.916 0.917 0.918 0.919 0.92  0.921 0.922 0.923 0.924
 0.925 0.926 0.927 0.928 0.929 0.93  0.931 0.932 0.933 0.934 0.935 0.936
 0.937 0.938 0.939 0.94  0.941 0.942 0.943 0.944 0.945 0.946 0.947 0.948
 0.949 0.95  0.96 ]

We can get a subset of the data by using index notation with a list of column names:

df[ ["dbn", "school_name", "total_enrollment", "poverty_n" ] ] returns a DataFrame with 4 columns.

df[ ["dbn", "school_name", "total_enrollment", "poverty_n" ] ]
dbn school_name total_enrollment poverty_n
0 01M015 P.S. 015 Roberto Clemente 178 152
1 01M015 P.S. 015 Roberto Clemente 190 161
2 01M015 P.S. 015 Roberto Clemente 174 147
3 01M015 P.S. 015 Roberto Clemente 190 155
4 01M015 P.S. 015 Roberto Clemente 193 158
... ... ... ... ...
9996 84X730 Bronx Charter School for the Arts 320 235
9997 84X730 Bronx Charter School for the Arts 314 258
9998 84X730 Bronx Charter School for the Arts 430 363
9999 84X730 Bronx Charter School for the Arts 523 453
10000 84X730 Bronx Charter School for the Arts 626 541

10001 rows × 4 columns