Calculating New Columns#

pandas makes it easy to calculate new fields based on existing data. This Notebook looks at the easy cases and then takes on some more advanced cases using our schooldemographcs data set.

Topics in this notebook:

  • calculating columns with vectorization

  • calculating columns with apply() on a series

  • lambda functions

  • use copy() to make a (shallow) copy of a data frame

from nycschools import schools

df = schools.load_school_demographics()
# with "vectorization"
# ---------------------

# calculate a new fields for the pct of school that is black or hispanic
df["black_hispanic_n"] = df.black_n + df.hispanic_n

# now calculate that as a pct of the total enrollment
df["black_hispanic_pct"] = df.black_hispanic_n / df.total_enrollment

df[ ["dbn","school_name","total_enrollment", "black_hispanic_n", "black_hispanic_pct"] ]
dbn school_name total_enrollment black_hispanic_n black_hispanic_pct
0 01M015 P.S. 015 Roberto Clemente 178 156 0.876404
1 01M015 P.S. 015 Roberto Clemente 190 162 0.852632
2 01M015 P.S. 015 Roberto Clemente 174 143 0.821839
3 01M015 P.S. 015 Roberto Clemente 190 152 0.800000
4 01M015 P.S. 015 Roberto Clemente 193 155 0.803109
... ... ... ... ... ...
9996 84X730 Bronx Charter School for the Arts 320 312 0.975000
9997 84X730 Bronx Charter School for the Arts 314 308 0.980892
9998 84X730 Bronx Charter School for the Arts 430 419 0.974419
9999 84X730 Bronx Charter School for the Arts 523 509 0.973231
10000 84X730 Bronx Charter School for the Arts 626 609 0.972843

10001 rows × 5 columns

# we can also use boolean expressions -- let's mark all of the schools not in districts 1-32 as "special_district"
df["special_district"] = df.district > 32
df[["dbn", "district", "special_district"]]
dbn district special_district
0 01M015 1 False
1 01M015 1 False
2 01M015 1 False
3 01M015 1 False
4 01M015 1 False
... ... ... ...
9996 84X730 84 True
9997 84X730 84 True
9998 84X730 84 True
9999 84X730 84 True
10000 84X730 84 True

10001 rows × 3 columns

# vectorization is the best way to create cols based on calculations but can't handle more advanced logic
# here we use apply() to format total enrollment to make it easier to read
# we'll call the new field total_enrollment_pp -- pp: pretty print

# create a function that we will "apply" to that columns
def fmt_enroll(n):
    return f"{n:,}"

df["total_enrollment_pp"] = df.total_enrollment.apply(fmt_enroll)
big_schools = df.sort_values(by="total_enrollment", ascending=False)[0:20]
big_schools[["dbn","total_enrollment", "total_enrollment_pp"]].head()
    
dbn total_enrollment total_enrollment_pp
3722 13K430 6040 6,040
3721 13K430 5937 5,937
3723 13K430 5921 5,921
3720 13K430 5838 5,838
3719 13K430 5682 5,682
# since our function is so simple, it's a good candidate for a lambda function
# lambdas in python are anonymous functions that have a reduced syntax
# see examples here:
# https://www.freecodecamp.org/news/python-lambda-function-explained/

# make a copy of our data
data = df.copy()

# use lambda to format the percentages
# we use the f-string syntax to round the number to a 2 decimal float
data["black_pct_pp"] = data.black_pct.apply(lambda x: f"{x*100:.02f}%")
data["black_pct_pp"]
0        28.70%
1        27.40%
2        27.60%
3        29.50%
4        27.50%
          ...  
9996     23.75%
9997     20.70%
9998     22.79%
9999     25.05%
10000    26.84%
Name: black_pct_pp, Length: 10001, dtype: object
# we can put the whole thing in a loop, too
# here we replace the original value with the formatted value
data = df.copy()
for c in data.columns:
    if c.endswith("_pct"):
        data[c] = data[c].apply(lambda x: f"{x*100:.02f}%")


data[["dbn", "asian_pct", "black_pct", "hispanic_pct", "white_pct"]].head()
dbn asian_pct black_pct hispanic_pct white_pct
0 01M015 7.90% 28.70% 59.00% 2.20%
1 01M015 10.50% 27.40% 57.90% 3.20%
2 01M015 13.80% 27.60% 54.60% 3.40%
3 01M015 14.20% 29.50% 50.50% 4.70%
4 01M015 13.50% 27.50% 52.80% 5.70%