How to deal with categorical features? And what is one-hot-encoding?

  • Categorical features usually indicate the type of observation. For instance, "Gender" can have two categories such as 'male' and 'female'.

  • There are several methods to handle categorical features, such as one-hot-encoding, target encoding, dummy encoding, hash encoding, etc.

  • One-hot-encoding maps a k-category feature to a set of k binary variables. When the feature falls in the ith category, a “1” value is placed in the ith variable and “0” values for the other variables.

  • In regard to the previous gender example, one-hot-encoding maps 'Gender' to two binary variables. Their values can be taken as follows:

Gender Male Female
variable-1 0 1
variable-2 1 0
  • Python code for one-hot-encoding:

from sklearn.preprocessing import OneHotEncoder

gender_encoder = OneHotEncoder()

gender_one_hot = gender_encoder.fit_transform(gender)

  • An issue of one-hot-encoding is that this representation makes redundancy, which causes multicollinearity in linear regression. With dummy encoding, for a k-category feature, we need only k-1 binary variables.

  • Back to the gender example, we need only one dummy variable \(D_g\), which takes values as follows:

Gender Male Female
\( D_g \) 0 1
  • Python code for dummy-encoding:

import pandas as pd

pd.get_dummies(gender, dummy_na=True, , drop_first=True)