Hi, I am Sachin Naik

Sachin Naik

Sachin Naik

I am a Certified Lean Six Sigma practitioner and trainer, have led / mentored 50+ Black Belt and 200+ Green Belt projects and trained 500+ mid and senior executives on ….

Learn More

Newsletter

Non Normal Data​ : How to deal with it?

Non Normal Data​ : How to deal with it?

One of the most common question that I get while mentoring GB/BB projects as well as while training LSS belts is “What should I do when I have non normal data?” Normally distributed data takes a center stage in statistics. A large number of statistical tests are based on the assumption of normality of the data, which instills a lot of fear in project leaders when there data is not normally distributed.

Do read about what Normal Distribution and Probability distributions are before you go on. (Opens in new tab).

A few years ago, some statisticians held a belief that when the processes has non normal data distribution, there is something wrong with the processes or that the processes were ‘out of control’. In their view, the purpose of a control chart was to determine when the processes were non-normal so they could be “corrected” and returned to normality. Fortunately, most statisticians and LSS practitioners today do not adhere to this belief. We recognize today that there is nothing wrong about a non-normal data and the preferred use of normally distributed data in statistics is only due to its simplicity and nothing more.

Many processes naturally follow a Non Normal Distribution, or a specific type of Non Normal distribution. Cycle time, calls per hour, customer waiting time, shrinkage etc., are a few examples of such processes.

Types of Non Normal Data Distributions

There are many types of Non-Normal distributions that a data set can follow, based on the nature of the process, the data collection methodology used, the sample size, outliers in the data etc. Few of the major Non-Normal distributions are listed below;

  1. Beta Distribution
  2. Exponential Distribution
  3. Gamma Distribution
  4. Inverse Gamma Distribution
  5. Log Normal Distribution
  6. Logistic Distribution
  7. Maxwell-Boltzmann Distribution
  8. Poisson Distribution
  9. Skewed Distribution
  10. Symmetric Distribution
  11. Uniform Distribution
  12. Unimodal Distribution
  13. Weibull Distribution

Reasons for Non Normal Data Distribution

Before you decide which distribution your data follows, ensure that your data is free of measurement system variation and does not have any data stability issues. Please read my previous posts on What variation is and what Measurement System Variation is.

Many processes or data sets naturally fit a Non Normal distribution. For example, the number of accidents will tend to fit a Poisson distribution and lifetimes of products usually fit a Weibull distribution. However, there may be times when your data is supposed to fit a normal distribution, but does not. For example, time taken to reach office from home data is usually supposed to fit a normal distribution. If you face a Non Normal distribution for such data sets, it is advised to check for the below reasons in your data and correct if needed.

Download my latest eBook – Lean Six Sigma Acronyms

Contains 220+ LSS acronyms and abbreviations, a handy reference guide for all LSS Practitioners. And its FREE!

  1. Outliers / Extreme values: Outliers can skew your distribution. The central tendency of your data set (Mean) is especially very sensitive to outliers and may result in a Non-Normal distribution. You should identify all the outliers, which may be extremely high or extremely low values in the data set or special causes in the process and remove them. Once done, check for normality again. It is important that outliers are identified as truly special causes before they are eliminated. The nature of normally distributed data is that, a small percentage of extreme values can be expected. Not every outlier is caused by a special reason. Extreme values should be removed the data only if there are more of them than expected under normal conditions.
  2. Subgroups / Overlap of two or more processes: A data set, which is a combination of two or more data sets from two or more processes combined into one, can also lead to a non-normal distribution. If you take two data sets, which follow normal distribution and merge them into one, it will follow a bimodal distribution. The remedial action for these situations is to determine the reasons, which cause bimodal or multimodal distribution and then stratify the data. Ensure that your data set is coherent and is not a mixture of multiple subgroups.
  3. Insufficient data discrimination: Round-off errors or measurement devices with poor resolution/precision can make truly continuous and normally distributed data look discrete and non-normal. Usage of a more accurate measurement system or collection of more data points should be done to overcome insufficient data discrimination or an insufficient number of different values.
  4. Smaller Sample size: This can cause a normally distributed data look scattered. For example, if you look at distribution of the height of 50 students in a particular class, you will see that it follows a normal distribution. However, if you randomly chose just three students from the same class, it may follow a uniform distribution or a skewed distribution as well depending on which students are chosen. Increasing your sample size until you get normal distribution usually resolve this issue.
  5. Values Close Process Boundaries: If a process has many values close to zero or close to a natural process boundary, the data distribution will skew to the right or left. In this case, a transformation, such as the Box-Cox power transformation, may help make data normal. When comparing transformed data, everything under comparison must be transformed in the same way.
  6. Sorted Data: Data collected from a normally distributed process can also fit a non-normal distribution if it represents just a sample / subset of the total output of the process. This happens when the collected data is sorted and then analyzed. Suppose there is a ring manufacturing process where the target is to produce rings with a diameter of 10 CM. The USL and LSL are 10.25 CM and 9.75 CM respectively. If the ring diameter data were collected from such a process and all values outside the specification limits were removed, it will show a non-normal distribution (uniform distribution), even though the data collected will originally be normally distributed
  7. Data Follows a Different Distribution: In addition to above-mentioned reasons where a normally distributed process data can show as non-normal, there are many data types, which follow a non-normal distribution by nature. In such cases, the data should be analyzed using the tests that do not assume normality.

How to deal with Non Normal Distribution

Once you have ensured that your data in non-normal due to the nature of the data / process itself and not due to any of the above-mentioned reasons, then you can proceed with analyzing the same. There are two ways to go about analyzing the non-normal data. Either use the non-parametric tests, which do not assume normality or transform the data using an appropriate function, forcing it to fit normal distribution.

Several tests are robust to the assumption of normality such as t-test, ANOVA, Regression and DOE. Such tests should only be used for normally distributed data. However, you may still be able to run these tests, with caution, if your sample size is large enough.

If you have a very small sample, a sample that is skewed or one that naturally fits another distribution type, you should run a non-parametric test. A non-parametric test is one that does not assume that the data fits any specific distribution type. Non-parametric tests include the Wilcoxon test, the Mann-Whitney Test, Moods Median test and the Kruskal-Wallis test. Below is the list of tests, which assumes normality, and the equivalent non-parametric test.

List of tests which assumes non normal data
List of tests which assumes non normal data

Generally, data needs to be statistically analyzed for two reasons. First involves various tests to see if the data is stable and to calculate the process capability / sigma levels in measure phase of the project using pre-improvement project Y data and in control phase using post-improvement project Y data. The next part involves hypothesis testing in Measure phase and control charts in control phase. The test equivalent non-parametric tests mentioned above are suitable used for hypothesis testing.

Download my latest eBook – Lean Six Sigma Acronyms

Contains 220+ LSS acronyms and abbreviations, a handy reference guide for all LSS Practitioners. And its FREE!

Let us also look at how to calculate process capability and sigma level when the data is non-normal.

Process Capability for Non Normal Data

When your data is normally distributed, we perform ‘Capability Analysis > Normal’ in Minitab to calculate process sigma. This capability analysis test assumes that the data is normal and accordingly calculates the Process Sigma (short term) and Cpk values.

Step 1 - Capability analysis for non normal data distribution
Step 1 – Capability analysis for non normal data distribution
Step 2 - Capability analysis for non normal data distribution
Step 2 – Capability analysis for non normal data distribution
Step 3 - Capability analysis for non normal data distribution
Step 3 – Capability analysis for non normal data distribution

However, when the data is non-normal, the same test cannot be used. The alternate test is ‘Capability Analysis > Non-Normal’. Below figure shows the path for this test.

Step 4 - Capability analysis for non normal data distribution
Step 4 – Capability analysis for non normal data distribution

One of the prerequisite for this test is to know the exact distribution that the data is following as evident from the below figure.

Step 5 - Capability analysis for non normal data distribution
Step 5 – Capability analysis for non normal data distribution

Download my latest eBook – Lean Six Sigma Acronyms

Contains 220+ LSS acronyms and abbreviations, a handy reference guide for all LSS Practitioners. And its FREE!

Hence, the first task is to identify the distribution of the data by using ‘Individual Distribution Identification’ test in Minitab.

Identify distribution for non normal data - 1
Identify distribution for non normal data – 1
Identify distribution for non normal data - 2
Identify distribution for non normal data – 2

The output of this test is multiple probability plots with p-values for each distribution that it tests.

Identify distribution for non normal data - 3
Identify distribution for non normal data – 3

The distribution with the highest p-value is the best-fit distribution of the data. Select this distribution in dialogue box for capability analysis test to calculate process sigma.

There are instances where, for some data sets, we do not get any distribution with p-value that is more than 0.05 concluding that the data set does not follow any distributions that the test looks for. In such scenarios, the one of the preferred remedial action is to transform the non-normal data into normal data using one of the data transformation methods. Box Cox power transformation and Johnson’s transformation are most preferred methods to for such data transformation. More about data transformation using these methods in the next article.

If you like what you read, please feel free to share this with your friends and colleagues using the sharing buttons below. Follow my blog. And do subscribe to the blog so you can be notified of new posts.

Liked this. Please help share with others too!

Leave a Reply

Your email address will not be published. Required fields are marked *