Encoding Techniques, The Role of Encoding in Preprocessing Categorical Data for Machine Learning

Thawuship
3 min readMar 18, 2023

--

Machine Learning algorithms require numerical data as input, whereas categorical data that represents groups or labels cannot be used directly in their original form. Therefore, encoding techniques are used to convert categorical data into numerical form, making it easier to process by machine learning models. This allows machine learning models to understand the relationship between different categories and use them to make predictions.

Nominal encoding and ordinal encoding are two common methods used for encoding categorical data. Nominal encoding assigns a unique numerical value to each category, whereas ordinal encoding assigns numbers based on their order or rank.

  1. Nominal Encoding -> Ex: Gender (Female, Male), Country(USA, UK, Australia)
  2. Ordinal Encoding -> Ex: Education(BSc, BTech, MSc, MTech, PhD)

1. Nominal Encoding

Nominal encoding is a common technique to convert categorical data into numerical form. It assigns a unique numerical value to each category. However, it does not capture the meaningful relationship between categories, as each category is assigned a distinct numerical value.

  1. One hot Encoding.

This technique is used to represent categorical variables as binary vectors. Each category is represented by a vector of binary values with a length equal to the number of categories, where all elements are zero except for the index corresponding to the category, which is one.

Code:

from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd

# create some example categorical data
categories = np.array(['red', 'green', 'blue', 'green', 'red'])

# create the one hot encoder
encoder = OneHotEncoder(sparse=False)

# fit the encoder to the data and transform it
one_hot_encoded = encoder.fit_transform(categories.reshape(-1, 1))

# Print the result
print("One-Hot Encoded Data with Feature Names:")
print(pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names()))

Output:

One-Hot Encoded Data with Feature Names:
x0_blue x0_green x0_red
0 0.0 0.0 1.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
3 0.0 1.0 0.0
4 0.0 0.0 1.0

Dummy variable Trap: delete one column

2. One hot encoding with multiple category.

When dealing with datasets that have many categorical variables with a large number of categories, one-hot encoding can become inefficient. Take more frequently repeated features and apply One hot Encoding.

Code:

from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np

# Define the data array
data = [['red', 'large'], ['green', 'small'], ['blue', 'medium'], ['green', 'medium'], ['red', 'small']]

# Convert the data to a numpy array
data_array = np.array(data)

# Create a one-hot encoder object and fit it to the data
encoder = OneHotEncoder()
encoder.fit(data_array)

# Transform the data using the encoder
one_hot_encoded = encoder.transform(data_array)

# Print the result
print("One-Hot Encoded Data with Feature Names:")
print(pd.DataFrame(one_hot_encoded.toarray(), columns=encoder.get_feature_names()))

Output:

One-Hot Encoded Data with Feature Names:
x0_blue x0_green x0_red x1_large x1_medium x1_small
0 0.0 0.0 1.0 1.0 0.0 0.0
1 0.0 1.0 0.0 0.0 0.0 1.0
2 1.0 0.0 0.0 0.0 1.0 0.0
3 0.0 1.0 0.0 0.0 1.0 0.0
4 0.0 0.0 1.0 0.0 0.0 1.0

2. Ordinal Encoding

Ordinal encoding is a technique used to convert categorical data into numerical form by assigning numbers based on the order or rank of the categories. It captures the meaningful relationship between categories, as the assigned numerical values have a natural order that reflects the relationship between the categories. This makes ordinal encoding useful for machine learning models that rely on the order or rank of the categories.

  1. Label Encoding

Code:

from sklearn.preprocessing import OrdinalEncoder

# create sample data
data = [['BSc'],['BTech'],['MSc'],['MTech'],['PhD']]

# create an instance of the encoder
encoder = OrdinalEncoder()

# fit and transform the data
encoded_data = encoder.fit_transform(data)

# print the encoded data
print(encoded_data)

Output:

[[0.]
[1.]
[2.]
[3.]
[4.]]

Conclusion:

However, ordinal encoding is not suitable for nominal variables with no natural order, as assigning arbitrary values may lead to erroneous results. Despite this limitation, ordinal encoding is a popular preprocessing technique for categorical data, particularly when the order or rank of the categories is relevant to the problem.

Overall, encoding techniques are essential for handling categorical data in machine learning, and choosing the appropriate encoding method depends on the specific requirements of the problem at hand.

Feel free to drop your comments.

Connect on : LinkedIn

--

--

Thawuship
Thawuship

Written by Thawuship

Engineer, RPA, Python, AI/ML, IoT, Automation, NLP, ENTC, AMIE(SL), AEng(ECSL)

No responses yet