Loading Data
Contents
Loading Data#
This notebook loads the NYC School Demographic data. Working with this dataset, we look at some basic Pandas operations.
We assume that you have a basic understanding of Python and Jupyter notebooks. This is a good start if you are new to Pandas and data science in Python.
In particular:
loading data from the
nycschools
into aDataFrame
use
head()
,tail()
, andSeries
(columns) to understand the dataaccess column
Series
by name using index notationuse
unique()
,min()
,max()
,sum()
, andmean()
to understand series data
# import schools from the nycschool package
from nycschools import schools
# load the demographic data into a `DataFrame` called df
df = schools.load_school_demographics()
Displaying data tables#
If we display df
notebook shows us some of the data from
the start of the data set and some from the end.
If we call df.head()
we get the start of the data. df.tail()
shows us the end of the data.
Comment/uncomment the different options to see how they work.
df
# df.head()
# df.tail()
dbn | beds | district | geo_district | boro | school_name | short_name | ay | year | total_enrollment | ... | missing_race_ethnicity_data_pct | swd_n | swd_pct | ell_n | ell_pct | poverty_n | poverty_pct | eni_pct | clean_name | zip | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01M015 | 310100010015 | 1 | 1 | Manhattan | P.S. 015 Roberto Clemente | PS 15 | 2016 | 2016-17 | 178 | ... | 0.000000 | 51 | 0.287000 | 12 | 0.067 | 152 | 0.854 | 0.882 | roberto clemente | 10009 |
1 | 01M015 | 310100010015 | 1 | 1 | Manhattan | P.S. 015 Roberto Clemente | PS 15 | 2017 | 2017-18 | 190 | ... | 0.000000 | 49 | 0.258000 | 8 | 0.042 | 161 | 0.847 | 0.890 | roberto clemente | 10009 |
2 | 01M015 | 310100010015 | 1 | 1 | Manhattan | P.S. 015 Roberto Clemente | PS 15 | 2018 | 2018-19 | 174 | ... | 0.000000 | 39 | 0.224000 | 8 | 0.046 | 147 | 0.845 | 0.888 | roberto clemente | 10009 |
3 | 01M015 | 310100010015 | 1 | 1 | Manhattan | P.S. 015 Roberto Clemente | PS 15 | 2019 | 2019-20 | 190 | ... | 0.000000 | 46 | 0.242000 | 17 | 0.089 | 155 | 0.816 | 0.867 | roberto clemente | 10009 |
4 | 01M015 | 310100010015 | 1 | 1 | Manhattan | P.S. 015 Roberto Clemente | PS 15 | 2020 | 2020-21 | 193 | ... | 0.000000 | 43 | 0.223000 | 21 | 0.109 | 158 | 0.819 | 0.856 | roberto clemente | 10009 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9996 | 84X730 | 320800860846 | 84 | 8 | Bronx | Bronx Charter School for the Arts | PS 730 | 2016 | 2016-17 | 320 | ... | 0.000000 | 67 | 0.209375 | 51 | 0.159 | 235 | 0.734 | 0.840 | bronx charter school for the arts | 10474 |
9997 | 84X730 | 320800860846 | 84 | 8 | Bronx | Bronx Charter School for the Arts | PS 730 | 2017 | 2017-18 | 314 | ... | 0.000000 | 68 | 0.216561 | 57 | 0.182 | 258 | 0.822 | 0.891 | bronx charter school for the arts | 10474 |
9998 | 84X730 | 320800860846 | 84 | 8 | Bronx | Bronx Charter School for the Arts | PS 730 | 2018 | 2018-19 | 430 | ... | 0.000000 | 103 | 0.239535 | 71 | 0.165 | 363 | 0.844 | 0.888 | bronx charter school for the arts | 10474 |
9999 | 84X730 | 320800860846 | 84 | 8 | Bronx | Bronx Charter School for the Arts | MS 730 | 2019 | 2019-20 | 523 | ... | 0.000000 | 117 | 0.223709 | 69 | 0.132 | 453 | 0.866 | 0.892 | bronx charter school for the arts | 10474 |
10000 | 84X730 | 320800860846 | 84 | 8 | Bronx | Bronx Charter School for the Arts | MS 730 | 2020 | 2020-21 | 626 | ... | 0.001597 | 153 | 0.244409 | 78 | 0.125 | 541 | 0.864 | 0.882 | bronx charter school for the arts | 10474 |
10001 rows × 51 columns
# The the `columns` property shows us the names of the cols in our `df`.
df.columns
Index(['dbn', 'beds', 'district', 'geo_district', 'boro', 'school_name',
'short_name', 'ay', 'year', 'total_enrollment',
'grade_3k_pk_half_day_full', 'grade_k', 'grade_1', 'grade_2', 'grade_3',
'grade_4', 'grade_5', 'grade_6', 'grade_7', 'grade_8', 'grade_9',
'grade_10', 'grade_11', 'grade_12', 'female_n', 'female_pct', 'male_n',
'male_pct', 'asian_n', 'asian_pct', 'black_n', 'black_pct',
'hispanic_n', 'hispanic_pct', 'multi_racial_n', 'multi_racial_pct',
'native_american_n', 'native_american_pct', 'white_n', 'white_pct',
'missing_race_ethnicity_data_n', 'missing_race_ethnicity_data_pct',
'swd_n', 'swd_pct', 'ell_n', 'ell_pct', 'poverty_n', 'poverty_pct',
'eni_pct', 'clean_name', 'zip'],
dtype='object')
We can access just one column using index notation –
df["poverty"]
gives us just that column. We can then display or sort the data using either
the python built-in function sorted()
or the pandas Series
function sort_values()
.
If we call unique()
we will get a list of the unique values in the Series
. In the case
of poverty
that lets us see that the column contains string data, and not raw numbers
if the poverty level is too high or too low.
# get just the poverty column
poverty = df["poverty_pct"]
# pandas also supports "dot notation" for access to columns
# but you can't use dot notation if the column name is not a valid identifier, python keyword,
# or name of a property or member function of the dataframe
poverty = df.poverty_pct # same as line 2 above
poverty = poverty.sort_values()
print("Note: precentages are displayed as real numbers between 0..1")
print(poverty.unique())
Note: precentages are displayed as real numbers between 0..1
[0.04 0.05 0.056 0.057 0.059 0.06 0.061 0.062 0.063 0.064 0.065 0.067
0.069 0.07 0.072 0.075 0.076 0.077 0.079 0.082 0.083 0.084 0.086 0.087
0.088 0.089 0.09 0.091 0.092 0.093 0.094 0.097 0.098 0.1 0.101 0.103
0.104 0.106 0.108 0.11 0.111 0.112 0.113 0.114 0.115 0.116 0.117 0.118
0.12 0.121 0.122 0.123 0.124 0.126 0.128 0.129 0.13 0.131 0.132 0.134
0.135 0.136 0.137 0.138 0.139 0.14 0.141 0.142 0.144 0.147 0.148 0.15
0.151 0.152 0.153 0.155 0.156 0.157 0.159 0.16 0.161 0.162 0.163 0.165
0.167 0.168 0.169 0.17 0.171 0.172 0.176 0.177 0.18 0.181 0.182 0.183
0.185 0.186 0.187 0.188 0.19 0.191 0.192 0.193 0.195 0.196 0.197 0.198
0.2 0.201 0.202 0.205 0.206 0.208 0.209 0.211 0.212 0.213 0.214 0.215
0.216 0.217 0.219 0.22 0.221 0.222 0.223 0.225 0.226 0.229 0.23 0.231
0.232 0.233 0.235 0.237 0.239 0.24 0.241 0.242 0.244 0.245 0.246 0.247
0.248 0.249 0.251 0.252 0.253 0.254 0.255 0.256 0.257 0.258 0.259 0.26
0.261 0.262 0.263 0.264 0.267 0.268 0.269 0.27 0.272 0.273 0.274 0.275
0.276 0.277 0.278 0.279 0.281 0.282 0.283 0.284 0.285 0.286 0.288 0.289
0.29 0.291 0.292 0.293 0.294 0.295 0.296 0.297 0.298 0.299 0.3 0.302
0.303 0.304 0.305 0.306 0.307 0.308 0.309 0.31 0.311 0.312 0.313 0.314
0.315 0.316 0.317 0.318 0.319 0.32 0.321 0.322 0.323 0.325 0.326 0.327
0.328 0.329 0.33 0.331 0.332 0.333 0.334 0.335 0.336 0.337 0.338 0.339
0.341 0.342 0.343 0.344 0.345 0.346 0.347 0.348 0.349 0.35 0.351 0.352
0.353 0.354 0.355 0.356 0.357 0.358 0.359 0.36 0.361 0.362 0.363 0.364
0.365 0.367 0.368 0.369 0.37 0.371 0.372 0.373 0.374 0.375 0.376 0.378
0.379 0.38 0.381 0.382 0.383 0.384 0.385 0.386 0.387 0.388 0.389 0.39
0.391 0.392 0.394 0.395 0.396 0.397 0.398 0.399 0.4 0.401 0.402 0.403
0.404 0.405 0.406 0.407 0.408 0.409 0.41 0.411 0.412 0.413 0.414 0.415
0.416 0.417 0.418 0.419 0.42 0.421 0.422 0.423 0.424 0.425 0.426 0.427
0.428 0.429 0.431 0.432 0.433 0.434 0.435 0.436 0.437 0.438 0.439 0.44
0.441 0.442 0.443 0.444 0.445 0.446 0.447 0.448 0.449 0.45 0.451 0.452
0.453 0.454 0.455 0.456 0.457 0.458 0.459 0.46 0.461 0.462 0.463 0.464
0.465 0.466 0.467 0.468 0.469 0.47 0.472 0.473 0.474 0.475 0.476 0.477
0.478 0.479 0.48 0.481 0.482 0.484 0.485 0.486 0.487 0.488 0.489 0.49
0.491 0.492 0.493 0.494 0.495 0.496 0.497 0.498 0.499 0.5 0.501 0.502
0.503 0.504 0.505 0.507 0.508 0.509 0.51 0.511 0.512 0.513 0.514 0.515
0.516 0.517 0.518 0.519 0.52 0.521 0.522 0.523 0.524 0.526 0.527 0.528
0.529 0.53 0.531 0.532 0.533 0.534 0.535 0.536 0.537 0.538 0.539 0.54
0.541 0.542 0.543 0.544 0.545 0.546 0.547 0.548 0.549 0.55 0.551 0.552
0.553 0.554 0.555 0.556 0.557 0.558 0.559 0.56 0.561 0.562 0.563 0.564
0.565 0.566 0.567 0.568 0.569 0.57 0.571 0.572 0.573 0.574 0.575 0.576
0.577 0.578 0.579 0.58 0.581 0.582 0.583 0.584 0.585 0.586 0.587 0.588
0.589 0.59 0.591 0.592 0.593 0.594 0.595 0.596 0.597 0.598 0.599 0.6
0.601 0.602 0.603 0.604 0.605 0.606 0.607 0.608 0.609 0.61 0.611 0.612
0.613 0.614 0.615 0.616 0.617 0.618 0.619 0.62 0.621 0.622 0.623 0.624
0.625 0.626 0.627 0.628 0.629 0.63 0.631 0.632 0.633 0.634 0.635 0.636
0.637 0.638 0.639 0.64 0.641 0.642 0.643 0.644 0.645 0.646 0.647 0.648
0.649 0.65 0.651 0.652 0.653 0.654 0.655 0.656 0.657 0.658 0.659 0.66
0.661 0.662 0.663 0.664 0.665 0.666 0.667 0.668 0.669 0.67 0.671 0.672
0.673 0.674 0.675 0.676 0.677 0.678 0.679 0.68 0.681 0.682 0.683 0.684
0.685 0.686 0.687 0.688 0.689 0.69 0.691 0.692 0.693 0.694 0.695 0.696
0.697 0.698 0.699 0.7 0.701 0.702 0.703 0.704 0.705 0.706 0.707 0.708
0.709 0.71 0.711 0.712 0.713 0.714 0.715 0.716 0.717 0.718 0.719 0.72
0.721 0.722 0.723 0.724 0.725 0.726 0.727 0.728 0.729 0.73 0.731 0.732
0.733 0.734 0.735 0.736 0.737 0.738 0.739 0.74 0.741 0.742 0.743 0.744
0.745 0.746 0.747 0.748 0.749 0.75 0.751 0.752 0.753 0.754 0.755 0.756
0.757 0.758 0.759 0.76 0.761 0.762 0.763 0.764 0.765 0.766 0.767 0.768
0.769 0.77 0.771 0.772 0.773 0.774 0.775 0.776 0.777 0.778 0.779 0.78
0.781 0.782 0.783 0.784 0.785 0.786 0.787 0.788 0.789 0.79 0.791 0.792
0.793 0.794 0.795 0.796 0.797 0.798 0.799 0.8 0.801 0.802 0.803 0.804
0.805 0.806 0.807 0.808 0.809 0.81 0.811 0.812 0.813 0.814 0.815 0.816
0.817 0.818 0.819 0.82 0.821 0.822 0.823 0.824 0.825 0.826 0.827 0.828
0.829 0.83 0.831 0.832 0.833 0.834 0.835 0.836 0.837 0.838 0.839 0.84
0.841 0.842 0.843 0.844 0.845 0.846 0.847 0.848 0.849 0.85 0.851 0.852
0.853 0.854 0.855 0.856 0.857 0.858 0.859 0.86 0.861 0.862 0.863 0.864
0.865 0.866 0.867 0.868 0.869 0.87 0.871 0.872 0.873 0.874 0.875 0.876
0.877 0.878 0.879 0.88 0.881 0.882 0.883 0.884 0.885 0.886 0.887 0.888
0.889 0.89 0.891 0.892 0.893 0.894 0.895 0.896 0.897 0.898 0.899 0.9
0.901 0.902 0.903 0.904 0.905 0.906 0.907 0.908 0.909 0.91 0.911 0.912
0.913 0.914 0.915 0.916 0.917 0.918 0.919 0.92 0.921 0.922 0.923 0.924
0.925 0.926 0.927 0.928 0.929 0.93 0.931 0.932 0.933 0.934 0.935 0.936
0.937 0.938 0.939 0.94 0.941 0.942 0.943 0.944 0.945 0.946 0.947 0.948
0.949 0.95 0.96 ]
We can get a subset of the data by using index notation with a list of column names:
df[ ["dbn", "school_name", "total_enrollment", "poverty_n" ] ]
returns a DataFrame
with 4 columns.
df[ ["dbn", "school_name", "total_enrollment", "poverty_n" ] ]
dbn | school_name | total_enrollment | poverty_n | |
---|---|---|---|---|
0 | 01M015 | P.S. 015 Roberto Clemente | 178 | 152 |
1 | 01M015 | P.S. 015 Roberto Clemente | 190 | 161 |
2 | 01M015 | P.S. 015 Roberto Clemente | 174 | 147 |
3 | 01M015 | P.S. 015 Roberto Clemente | 190 | 155 |
4 | 01M015 | P.S. 015 Roberto Clemente | 193 | 158 |
... | ... | ... | ... | ... |
9996 | 84X730 | Bronx Charter School for the Arts | 320 | 235 |
9997 | 84X730 | Bronx Charter School for the Arts | 314 | 258 |
9998 | 84X730 | Bronx Charter School for the Arts | 430 | 363 |
9999 | 84X730 | Bronx Charter School for the Arts | 523 | 453 |
10000 | 84X730 | Bronx Charter School for the Arts | 626 | 541 |
10001 rows × 4 columns