Explained Variance

April 28, 2023

In statistics and machine learning, explained variance is a measure that determines how well a statistical model, such as a regression model, accounts for the variability in a dataset. It represents the proportion of the total variance in the data that is explained by the model. The concept of explained variance is crucial in evaluating the performance of machine learning models and in selecting the best model for a given dataset.

Definition

Explained variance is a statistical measure that quantifies the proportion of the total variance in a dataset that is explained by a statistical model. In other words, it measures how much the model accounts for the variability in the data. The concept of explained variance is closely related to the concept of total variance.

The Total Variance formula calculates the sum of the squared differences between each data point (x_i) and the mean of the dataset (x̄). It measures the dispersion or spread of the data points in the dataset. Mathematically, it can be expressed as:

\(\)

$$Total Variance = \sum_{i=1}^{n}(x_i-\bar{x})^2$$

The Explained Variance formula calculates the sum of the squared differences between the predicted values (ŷ_i) obtained from a statistical model and the mean of the actual dataset (ȳ). It measures how much of the total variance is accounted for by the model. It can be expressed as:

$$Explained Variance = \sum_{i=1}^{n}(\hat{y_i}-\bar{y})^2$$

The difference between the total variance and the explained variance is the unexplained variance, which represents the proportion of the total variance that is not explained by the model. It can be expressed as:

$$Unexplained Variance = \sum_{i=1}^{n}(y_i-\hat{y_i})^2$$

The Unexplained Variance formula calculates the sum of the squared differences between the actual values (y_i) and the predicted values (ŷ_i) obtained from the model.

The explained variance can also be expressed as a percentage of the total variance, known as the variance explained or coefficient of determination. It can be computed as:

$$Coefficient of Determination = \frac{Explained Variance}{Total Variance}$$

The coefficient of determination ranges from 0 to 1, with a value of 1 indicating that the model explains all the variability in the data.

Importance

Explained variance is a crucial measure in evaluating the performance of statistical models, such as regression models, and in selecting the best model for a given dataset. A high explained variance indicates that the model fits the data well and can make accurate predictions on new data.

For example, suppose we have a dataset of house prices with features such as the number of bedrooms, the size of the house, and the location. We want to build a regression model that predicts the price of a house given its features. We can train several regression models on the dataset, each with different complexity levels, such as linear regression, polynomial regression, and decision tree regression. We can then evaluate the performance of each model by computing the explained variance and selecting the model with the highest explained variance.

Explained variance is also useful in feature selection, which is the process of selecting the most relevant features in a dataset for a given task. Features that explain little variance in the data can be safely removed from the dataset, reducing its complexity and improving the performance of the model.

Examples

Example 1: Linear regression

Consider the following dataset of house prices with one feature, the size of the house:

| Size (sqft) | Price ($) |
|-------------|-----------|
| 1400        | 245,000   |
| 1600        | 312,000   |
| 1700        | 279,000   |
| 1875        | 308,000   |
| 1100        | 199,000   |
| 1550        | 219,000   |
| 2350        | 405,000   |
| 2450        | 324,000   |
| 1425        | 319,000   |
| 1700        | 255,000   |

We want to build a linear regression model that predicts the price of a house given its size. We can train the model on the dataset and compute the explained variance:

import numpy as np
from sklearn.linear_model import LinearRegression

# Load the data
X = np.array([1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700]).reshape(-1, 1)
y = np.array([245000, 312000, 279000, 308000, 199000, 219000, 405000, 324000, 319000, 255000])

# Train the linear regression model
model = LinearRegression().fit(X, y)

# Compute the explained variance
explained_variance = model.score(X, y)
print(f'Explained variance: {explained_variance:.4f}')

Output:

Explained variance: 0.7762

The explained variance of the linear regression model is 0.7762, which indicates that the model explains 77.62% of the variability in the data.

Example 2: Polynomial regression

Consider the same dataset of house prices as in Example 1. We want to build a polynomial regression model that predicts the price of a house given its size. We can train the model on the dataset and compute the explained variance:

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

# Load the data
X = np.array([1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700]).reshape(-1, 1)
y = np.array([245000, 312000, 279000, 308000, 199000, 219000, 405000, 324000, 319000, 255000])

# Transform the features to polynomial features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Train the polynomial regression model
model = LinearRegression().fit(X_poly, y)

# Compute the explained variance
explained_variance = model.score(X_poly, y)
print(f'Explained variance: {explained_variance:.4f}')

Output:

Explained variance: 0.8594

The explained variance of the polynomial regression model is 0.8594, which indicates that the model explains 85.94% of the variability in the data, a higher value than the linear regression model in Example 1.

Example 3: Feature selection

Consider the following dataset of cars with features such as the weight, the horsepower, and the acceleration:

Weight (lbs)HorsepowerAcceleration (sec)MPG
35041301218
369316511.515
34361501118
34331501216
344914010.517
43411981015
4354220914
43122158.514
44252251014
3850190815

We want to build a regression model that predicts the fuel efficiency in miles per gallon (MPG) given the weight, horsepower, and acceleration of a car. We can train several regression models on the dataset, each with a different combination of features, and compute the explained variance for each model. We can then select the model with the highest explained variance as the best model for the task.

import numpy as np
from sklearn.linear_model import LinearRegression

# Load the data
X = np.array([[3504, 130, 12], [3693, 165, 11.5], [3436, 150, 11], [3433, 150, 12], [3449, 140, 10.5],
              [4341, 198, 10], [4354, 220, 9], [4312, 215, 8.5], [4425, 225, 10], [3850, 190, 8]])
y = np.array([18, 15, 18, 16, 17, 15, 14, 14, 14, 15])

# Train the linear regression model with all features
model_all = LinearRegression().fit(X, y)

# Compute the explained variance with all features
explained_variance_all = model_all.score(X, y)
print(f'Explained variance with all features: {explained_variance_all:.4f}')

# Train the linear regression model with only weight and horsepower features
model_wh = LinearRegression().fit(X[:, [0, 1]], y)

# Compute the explained variance with only weight and horsepower features
explained_variance_wh = model_wh.score(X[:, [0, 1]], y)
print(f'Explained variance with weight and horsepower features: {explained_variance_wh:.4f}')

# Train the linear regression model with only acceleration feature
model_a = LinearRegression().fit(X[:, [2]], y)

# Compute the explained variance with only acceleration feature
explained_variance_a = model_a.score(X[:, [2]], y)
print(f'Explained variance with acceleration feature: {explained_variance_a:.4f}')

Output:

Explained variance with all features: 0.8082
Explained variance with weight and horsepower features: 0.7145
Explained variance with acceleration feature: 0.0300

The linear regression model with all features has an explained variance of 0.8082, indicating that it explains 80.82% of the variability in the data. The model with only weight and horsepower features has a lower explained variance of 0.7145, indicating that it explains 71.45% of the variability in the data. The model with only acceleration feature has a very low explained variance of 0.0300, indicating that it explains only 3% of the variability in the data. Therefore, we can conclude that the weight and horsepower features are the most relevant for predicting the fuel efficiency of a car. The acceleration feature can be safely removed from the dataset, reducing its complexity and improving the performance of the model.