Standardization
May 20, 2023
Standardization is a process used in machine learning to transform data into a standard format that can be easily interpreted by algorithms. It is the process of rescaling the features so that they have a mean of 0 and a standard deviation of 1. This process is also known as normalization and is a critical step in data preprocessing before applying any machine learning algorithm.
Why Standardization is Important?
The primary purpose of standardization is to bring all the features to a similar scale, and this helps the learning algorithm to interpret the data efficiently. Standardization is particularly important in cases where the features have different units of measurement.
Consider the example of a dataset that contains height and weight of individuals. The height is measured in centimeters, while weight is measured in pounds. If we use such data without standardization, then the algorithm may give more weight to weight than height, as the values of weight are much greater than the values of height.
Therefore, standardization ensures that each feature has an equal opportunity to influence the learning model, and this leads to better predictions.
How Standardization Works?
Standardization involves two steps:
-
Centering: In this step, the mean value of each feature is calculated, and then the mean is subtracted from each value of that feature. This process centers the feature around zero.
-
Scaling: After centering, the standard deviation of each feature is calculated, and then each value of that feature is divided by its standard deviation. This scaling process ensures that the values of each feature have the same range.
The formula for standardization is given below:
\(\)$$Z = (X – μ) / σ$$
Where:
- Z is the standardized value
- X is the original value
- μ is the mean of the feature
- σ is the standard deviation of the feature
Example of Standardization
Let’s consider an example to demonstrate the process of standardization. We have a dataset containing the height and weight of individuals in inches and pounds, respectively.
| Height (in) | Weight (lb) |
|:-----------:|:-----------:|
| 70 | 160 |
| 65 | 120 |
| 68 | 140 |
| 74 | 180 |
First, we need to calculate the mean and standard deviation of each feature. The mean and standard deviation are calculated as follows:
Mean Height = (70 + 65 + 68 + 74) / 4 = 69.25
Mean Weight = (160 + 120 + 140 + 180) / 4 = 150
Standard Deviation Height = 2.71
Standard Deviation Weight = 23.09
Now, we can use the formula to standardize the data:
Z Height = (Height - Mean Height) / Standard Deviation Height
Z Weight = (Weight - Mean Weight) / Standard Deviation Weight
The standardized data is as follows:
| Z Height | Z Weight |
|:--------:|:--------:|
| 0.55 | -0.38 |
| -1.09 | -1.57 |
| -0.27 | -0.97 |
| 1.82 | 2.92 |
Now the data is standardized, and we can use this data for further analysis.