Sep 30, 2021 1 min read

How to deal with categorical features? And what is one-hot-encoding?

Categorical features usually indicate the type of observation. For instance, "Gender" can have two categories such as 'male' and 'female'.
There are several methods to handle categorical features, such as one-hot-encoding, target encoding, dummy encoding, hash encoding, etc.
One-hot-encoding maps a k-category feature to a set of k binary variables. When the feature falls in the ith category, a “1” value is placed in the ith variable and “0” values for the other variables.
In regard to the previous gender example, one-hot-encoding maps 'Gender' to two binary variables. Their values can be taken as follows:

Gender	Male	Female
variable-1	0	1
variable-2	1	0

from sklearn.preprocessing import OneHotEncoder

gender_encoder = OneHotEncoder()

gender_one_hot = gender_encoder.fit_transform(gender)

An issue of one-hot-encoding is that this representation makes redundancy, which causes multicollinearity in linear regression. With dummy encoding, for a k-category feature, we need only k-1 binary variables.
Back to the gender example, we need only one dummy variable \(D_g\), which takes values as follows:

Gender	Male	Female
\( D_g \)	0	1

import pandas as pd

pd.get_dummies(gender, dummy_na=True, , drop_first=True)