Features and labels are essential while working with machine learning. Whenever you start working, you need to first convert the categories into numerical format so that the machine learning algorithm can process them. Two common techniques for this transformation are one-hot encoding and ordinal encoding. Understanding when and how to use each method is crucial for preparing data. In this post, we will get an overview of both encoding methods, and explain how they work.
Let’s begin!
Overview
One-Hot-Encoding:
It is a common technique used to convert categorical data into a numerical format that can be used by machine learning algorithms.
How Does It Work?
Each category is represented by a binary vector.
Suppose you have a feature with N unique categories. One-hot encoding creates N binary features, one for each category.
Each row in the dataset will have one of these features set to 1 (indicating the presence of that category) and the rest set to 0.
Example 1
Consider a dataset with a qualitative feature “Color” that has three possible values: “Red,” “Blue,” and “Green.”
Car ID | Color |
1 | Red |
2 | Blue |
3 | Green |
4 | Blue |
5 | Red |
Here’s one-hot encoded data
Car ID | Color_Red | Color_Blue | Color_Green |
1 | 1 | 0 | 0 |
2 | 0 | 1 | 0 |
3 | 0 | 0 | 1 |
4 | 0 | 1 | 0 |
5 | 1 | 0 | 0 |
Example 2
Person ID | Education Level |
1 | High School |
2 | Bachelor’s |
3 | Master’s |
4 | PhD |
Person ID | High School | Bachelor’s | Master’s | PhD |
1 | 1 | 0 | 0 | 0 |
2 | 0 | 1 | 0 | 0 |
3 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 1 | 0 |
5 | 0 | 0 | 0 | 1 |
Side Note: When working with classification problems, we often use one-hot encoding to convert categorical labels into a binary format suitable for machine learning algorithms. This technique represents each category as a binary vector, making it easier for models to process and learn from the data.
Ordinal Encoding
Ordinal encoding is a method used to convert ordinal (categorical) features into numerical values while preserving the inherent order of the categories. Unlike one-hot encoding, which creates binary columns for each category, ordinal encoding assigns an integer value to each category based on their order.
How Does It Work?
In ordinal encoding, each category is mapped to an integer that reflects its rank or position in the sequence. This encoding is useful when the categorical data has a meaningful order, but the intervals between the values are not necessarily equal.
Let’s work on example 2 from one-hot-encoding and see how it is different from ordinal encoding.
Person ID | Education Level |
1 | High School |
2 | Bachelor’s |
3 | Master’s |
4 | PhD |
Ordinal Encoding Steps:
- Identify the Order:
- High School < Bachelor’s < Master’s < PhD
- Assign Integer Values:
- High School = 1
- Bachelor’s = 2
- Master’s = 3
- PhD = 4
Person ID | Education Level |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 4 |
In this encoded data, the numerical values reflect the inherent order of the education levels.
When to Use Ordinal Encoding
Ordinal encoding is suitable for ordinal features where the order matters, but the exact differences between the categories are not known or are not equal. Examples include:
- Education levels (High School, Bachelor’s, Master’s, PhD)
- Customer satisfaction ratings (Poor, Fair, Good, Excellent)
- T-shirt sizes (Small, Medium, Large, Extra Large)
When to Use Which?
- One-Hot Encoding: Use it when the categorical variable does not have an inherent order. For example, if the variable is “Color” (Red, Green, Blue), there is no natural order.
- Ordinal Encoding: Use it when the categorical variable has a clear, meaningful order. For example, if the variable is “Size” (Small, Medium, Large), there is a natural progression from Small to Large.
Summing Up
Choosing the appropriate encoding method depends on the nature of the categorical data and the specific requirements of the machine learning algorithm being used.