Tidying Dataframe Column Names

A simple script to tidying up messing column names in pandas dataframe
pandas
TIL
Author

Arif Qodari

Published

August 1, 2022

Let’s use this sample data.

Code
df = pd.read_csv("https://gist.githubusercontent.com/arifqodari/965b854469b279df026e305864a944c0/raw/43984c8596373d64e602717ed6c3313393c85b2e/flight_dummy.csv")
df
flightNumber Airline Name departure date_time
0 QZ-274 AirAsia Indonesia 2023-11-22 20:30
1 QG-526 Citilink 2023-11-22 16:20
2 QG-522 Citilink 2023-11-22 06:00

Next, let’s define our method tidy_columns containing series of functions to tidy up the column names. Then apply the function to the dataframe above.

def camel_to_snake(str_):
    str_ = re.sub("(.)([A-Z][a-z]+)", r"\1 \2", str_)
    return re.sub("([a-z0-9])([A-Z])", r"\1 \2", str_).lower()

def whitespace_to_underscore(str_):
    return re.sub("\s+", "_", str_)

def tidy_columns(df):
    df.columns = (
        df.columns
        .str.strip()  # strip text
        .map(camel_to_snake)  # change to snake case
        .map(whitespace_to_underscore)  # change any whitespace to underscore
        .str.lower()  # lowercase
        .str.replace("(", "")
        .str.replace(")", "")
    )
    return df


tidy_df = tidy_columns(df)
tidy_df
flight_number airline_name departure_date_time
0 QZ-274 AirAsia Indonesia 2023-11-22 20:30
1 QG-526 Citilink 2023-11-22 16:20
2 QG-522 Citilink 2023-11-22 06:00

Neat! The simple method above tidy_columns do the following:

If you think that it needs additional treatment, you can simply create another function that accept a string and return a string and put another .map(your_function) in the tidy function above.