Calculating New Columns
Calculating New Columns#
pandas
makes it easy to calculate new fields based on
existing data. This Notebook looks at the easy cases
and then takes on some more advanced cases using our
schooldemographcs data set.
Topics in this notebook:
calculating columns with vectorization
calculating columns with
apply()
on a serieslambda functions
use
copy()
to make a (shallow) copy of a data frame
from nycschools import schools
df = schools.load_school_demographics()
# with "vectorization"
# ---------------------
# calculate a new fields for the pct of school that is black or hispanic
df["black_hispanic_n"] = df.black_n + df.hispanic_n
# now calculate that as a pct of the total enrollment
df["black_hispanic_pct"] = df.black_hispanic_n / df.total_enrollment
df[ ["dbn","school_name","total_enrollment", "black_hispanic_n", "black_hispanic_pct"] ]
dbn | school_name | total_enrollment | black_hispanic_n | black_hispanic_pct | |
---|---|---|---|---|---|
0 | 01M015 | P.S. 015 Roberto Clemente | 178 | 156 | 0.876404 |
1 | 01M015 | P.S. 015 Roberto Clemente | 190 | 162 | 0.852632 |
2 | 01M015 | P.S. 015 Roberto Clemente | 174 | 143 | 0.821839 |
3 | 01M015 | P.S. 015 Roberto Clemente | 190 | 152 | 0.800000 |
4 | 01M015 | P.S. 015 Roberto Clemente | 193 | 155 | 0.803109 |
... | ... | ... | ... | ... | ... |
9996 | 84X730 | Bronx Charter School for the Arts | 320 | 312 | 0.975000 |
9997 | 84X730 | Bronx Charter School for the Arts | 314 | 308 | 0.980892 |
9998 | 84X730 | Bronx Charter School for the Arts | 430 | 419 | 0.974419 |
9999 | 84X730 | Bronx Charter School for the Arts | 523 | 509 | 0.973231 |
10000 | 84X730 | Bronx Charter School for the Arts | 626 | 609 | 0.972843 |
10001 rows × 5 columns
# we can also use boolean expressions -- let's mark all of the schools not in districts 1-32 as "special_district"
df["special_district"] = df.district > 32
df[["dbn", "district", "special_district"]]
dbn | district | special_district | |
---|---|---|---|
0 | 01M015 | 1 | False |
1 | 01M015 | 1 | False |
2 | 01M015 | 1 | False |
3 | 01M015 | 1 | False |
4 | 01M015 | 1 | False |
... | ... | ... | ... |
9996 | 84X730 | 84 | True |
9997 | 84X730 | 84 | True |
9998 | 84X730 | 84 | True |
9999 | 84X730 | 84 | True |
10000 | 84X730 | 84 | True |
10001 rows × 3 columns
# vectorization is the best way to create cols based on calculations but can't handle more advanced logic
# here we use apply() to format total enrollment to make it easier to read
# we'll call the new field total_enrollment_pp -- pp: pretty print
# create a function that we will "apply" to that columns
def fmt_enroll(n):
return f"{n:,}"
df["total_enrollment_pp"] = df.total_enrollment.apply(fmt_enroll)
big_schools = df.sort_values(by="total_enrollment", ascending=False)[0:20]
big_schools[["dbn","total_enrollment", "total_enrollment_pp"]].head()
dbn | total_enrollment | total_enrollment_pp | |
---|---|---|---|
3722 | 13K430 | 6040 | 6,040 |
3721 | 13K430 | 5937 | 5,937 |
3723 | 13K430 | 5921 | 5,921 |
3720 | 13K430 | 5838 | 5,838 |
3719 | 13K430 | 5682 | 5,682 |
# since our function is so simple, it's a good candidate for a lambda function
# lambdas in python are anonymous functions that have a reduced syntax
# see examples here:
# https://www.freecodecamp.org/news/python-lambda-function-explained/
# make a copy of our data
data = df.copy()
# use lambda to format the percentages
# we use the f-string syntax to round the number to a 2 decimal float
data["black_pct_pp"] = data.black_pct.apply(lambda x: f"{x*100:.02f}%")
data["black_pct_pp"]
0 28.70%
1 27.40%
2 27.60%
3 29.50%
4 27.50%
...
9996 23.75%
9997 20.70%
9998 22.79%
9999 25.05%
10000 26.84%
Name: black_pct_pp, Length: 10001, dtype: object
# we can put the whole thing in a loop, too
# here we replace the original value with the formatted value
data = df.copy()
for c in data.columns:
if c.endswith("_pct"):
data[c] = data[c].apply(lambda x: f"{x*100:.02f}%")
data[["dbn", "asian_pct", "black_pct", "hispanic_pct", "white_pct"]].head()
dbn | asian_pct | black_pct | hispanic_pct | white_pct | |
---|---|---|---|---|---|
0 | 01M015 | 7.90% | 28.70% | 59.00% | 2.20% |
1 | 01M015 | 10.50% | 27.40% | 57.90% | 3.20% |
2 | 01M015 | 13.80% | 27.60% | 54.60% | 3.40% |
3 | 01M015 | 14.20% | 29.50% | 50.50% | 4.70% |
4 | 01M015 | 13.50% | 27.50% | 52.80% | 5.70% |