Box-Cox Transformation.

Ronak Chhatbar
4 min readAug 25, 2019

--

The real-world data is not always distributed the way we want, that is Normal-Distribution It is always distributed in some distribution which we have no idea about some time it is skewed towards the right other time it has a long tail this leads us to miss normal distribution, Why we miss normal distribution you ask?

The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. It is symmetric distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions.

The Box-Cox transformation is a particularly useful family of transformations. It is defined as:

where y^λ is the response variable and λ is the transformation parameter, For λ = 0, the natural log of the data is taken instead of using the above formula, here λ is a hyperparameter which has to be tuned according to the dataset

Let’s see box-cox in action

Import the necessary libraries

from scipy import stats
import pandas as pd
import numpy as np
import pylab
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
%matplotlib inline

Let’s create a skewed distribution

skewed_dist = stats.loggamma.rvs(5, size=10000) + 5

Now let’s create a normal distribution

normal_dist = np.random.normal(0, 1, 10000)

Now let’s calculate the skewness of the normal distribution and look at the plot of the distributions

sns.distplot(s)
print("Skewness for the normal distribution:",skew(normal_dist))
sns.distplot(x)
print("Skewness for the skewed distribution:",skew(skewed_dist))
Distribution plot for normal_data
Distribution plot for skewed_data

Here you can see how the second distribution is left-skewed

skewness = 0 : normally distributed.
skewness > 0 : more weight in the left tail of the distribution.
skewness < 0 : more weight in the right tail of the distribution.

Pearson’s Coefficient of Skewness

skewness = 3(X-Me) / σ

where X = mean of the distribution

Me = median of the distribution and

σ = standard deviation

The probability plot is a graphical technique for assessing whether or not a data set follows a normal distribution

stats.probplot(normal_dist, dist=”norm”, plot=pylab)
pylab.show()
stats.probplot(skewed_dist, dist=”norm”, plot=pylab)
pylab.show()
Probability-Plot for normal_data
Probability-Plot for skewed_data

In the above diagram for the skewed plot, we can see that the data is not normally distributed since the point don’t align with the red line the plot is not normally distributed

Now we apply our box-cox Transformation and plot it

skewed_box_cox, lmda = stats.boxcox(skwed_dist)
sns.distplot(skewed_box_cox)
Distribution after applying Box-Cox

Let’s check the Probability-Plot and see whether the data is normally distributed or not and get the appropriate lambda value.

stats.probplot(skewed_box_cox, dist=”norm”, plot=pylab)
pylab.show()
print ("lambda parameter for Box-Cox Transformation is:",lmda)
lambda parameter and Probability-Plot for Box-Cox Transformation

In Conclusion, Box-cox transformation attempts to transform a set of data to a normal distribution by finding the value of λ that minimizes the variation. This allows you to perform those calculations that require the data to be normally distributed, The Box-Cox transformation does not always convert the data to a normal distribution. You must check the transformation to ensure it worked.

About me

I am an Artifical Intelligence Developer at Wavelabs.ai. We at Wavelabs help you leverage Artificial Intelligence (AI) to revolutionize user experiences and reduce costs. We uniquely enhance your products using AI to reach your full market potential. We try to bring cutting edge research into your applications. Have a look at us.

You can reach me out at LinkedIn

--

--