Helping you Understand Lean and Six Sigma tools and concepts in the easiest way possible.

# Non Normal Data​ : How to deal with it? One of the most common question that I get while mentoring GB/BB projects as well as while training LSS belts is “What should I do when I have non normal data?” Normally distributed data takes a center stage in statistics. A large number of statistical tests are based on the assumption of normality of the data, which instills a lot of fear in project leaders when there data is not normally distributed.

Do read about what Normal Distribution and Probability distributions are before you go on. (Opens in new tab).

A few years ago, some statisticians held a belief that when the processes has non normal data distribution, there is something wrong with the processes or that the processes were ‘out of control’. In their view, the purpose of a control chart was to determine when the processes were non-normal so they could be “corrected” and returned to normality. Fortunately, most statisticians and LSS practitioners today do not adhere to this belief. We recognize today that there is nothing wrong about a non-normal data and the preferred use of normally distributed data in statistics is only due to its simplicity and nothing more.

Many processes naturally follow a Non Normal Distribution, or a specific type of Non Normal distribution. Cycle time, calls per hour, customer waiting time, shrinkage etc., are a few examples of such processes.

### Types of Non Normal Data Distributions

There are many types of Non-Normal distributions that a data set can follow, based on the nature of the process, the data collection methodology used, the sample size, outliers in the data etc. Few of the major Non-Normal distributions are listed below;

1. Beta Distribution
2. Exponential Distribution
3. Gamma Distribution
4. Inverse Gamma Distribution
5. Log Normal Distribution
6. Logistic Distribution
7. Maxwell-Boltzmann Distribution
8. Poisson Distribution
9. Skewed Distribution
10. Symmetric Distribution
11. Uniform Distribution
12. Unimodal Distribution
13. Weibull Distribution

### Reasons for Non Normal Data Distribution

Before you decide which distribution your data follows, ensure that your data is free of measurement system variation and does not have any data stability issues. Please read my previous posts on What variation is and what Measurement System Variation is.

Many processes or data sets naturally fit a Non Normal distribution. For example, the number of accidents will tend to fit a Poisson distribution and lifetimes of products usually fit a Weibull distribution. However, there may be times when your data is supposed to fit a normal distribution, but does not. For example, time taken to reach office from home data is usually supposed to fit a normal distribution. If you face a Non Normal distribution for such data sets, it is advised to check for the below reasons in your data and correct if needed.

Contains 220+ LSS acronyms and abbreviations, a handy reference guide for all LSS Practitioners. And its FREE!

1. Outliers / Extreme values: Outliers can skew your distribution. The central tendency of your data set (Mean) is especially very sensitive to outliers and may result in a Non-Normal distribution. You should identify all the outliers, which may be extremely high or extremely low values in the data set or special causes in the process and remove them. Once done, check for normality again. It is important that outliers are identified as truly special causes before they are eliminated. The nature of normally distributed data is that, a small percentage of extreme values can be expected. Not every outlier is caused by a special reason. Extreme values should be removed the data only if there are more of them than expected under normal conditions.
2. Subgroups / Overlap of two or more processes: A data set, which is a combination of two or more data sets from two or more processes combined into one, can also lead to a non-normal distribution. If you take two data sets, which follow normal distribution and merge them into one, it will follow a bimodal distribution. The remedial action for these situations is to determine the reasons, which cause bimodal or multimodal distribution and then stratify the data. Ensure that your data set is coherent and is not a mixture of multiple subgroups.
3. Insufficient data discrimination: Round-off errors or measurement devices with poor resolution/precision can make truly continuous and normally distributed data look discrete and non-normal. Usage of a more accurate measurement system or collection of more data points should be done to overcome insufficient data discrimination or an insufficient number of different values.
4. Smaller Sample size: This can cause a normally distributed data look scattered. For example, if you look at distribution of the height of 50 students in a particular class, you will see that it follows a normal distribution. However, if you randomly chose just three students from the same class, it may follow a uniform distribution or a skewed distribution as well depending on which students are chosen. Increasing your sample size until you get normal distribution usually resolve this issue.
5. Values Close Process Boundaries: If a process has many values close to zero or close to a natural process boundary, the data distribution will skew to the right or left. In this case, a transformation, such as the Box-Cox power transformation, may help make data normal. When comparing transformed data, everything under comparison must be transformed in the same way.
6. Sorted Data: Data collected from a normally distributed process can also fit a non-normal distribution if it represents just a sample / subset of the total output of the process. This happens when the collected data is sorted and then analyzed. Suppose there is a ring manufacturing process where the target is to produce rings with a diameter of 10 CM. The USL and LSL are 10.25 CM and 9.75 CM respectively. If the ring diameter data were collected from such a process and all values outside the specification limits were removed, it will show a non-normal distribution (uniform distribution), even though the data collected will originally be normally distributed
7. Data Follows a Different Distribution: In addition to above-mentioned reasons where a normally distributed process data can show as non-normal, there are many data types, which follow a non-normal distribution by nature. In such cases, the data should be analyzed using the tests that do not assume normality.

### How to deal with Non Normal Distribution

Once you have ensured that your data in non-normal due to the nature of the data / process itself and not due to any of the above-mentioned reasons, then you can proceed with analyzing the same. There are two ways to go about analyzing the non-normal data. Either use the non-parametric tests, which do not assume normality or transform the data using an appropriate function, forcing it to fit normal distribution.

Several tests are robust to the assumption of normality such as t-test, ANOVA, Regression and DOE. Such tests should only be used for normally distributed data. However, you may still be able to run these tests, with caution, if your sample size is large enough.

If you have a very small sample, a sample that is skewed or one that naturally fits another distribution type, you should run a non-parametric test. A non-parametric test is one that does not assume that the data fits any specific distribution type. Non-parametric tests include the Wilcoxon test, the Mann-Whitney Test, Moods Median test and the Kruskal-Wallis test. Below is the list of tests, which assumes normality, and the equivalent non-parametric test.

Generally, data needs to be statistically analyzed for two reasons. First involves various tests to see if the data is stable and to calculate the process capability / sigma levels in measure phase of the project using pre-improvement project Y data and in control phase using post-improvement project Y data. The next part involves hypothesis testing in Measure phase and control charts in control phase. The test equivalent non-parametric tests mentioned above are suitable used for hypothesis testing.

Contains 220+ LSS acronyms and abbreviations, a handy reference guide for all LSS Practitioners. And its FREE!

Let us also look at how to calculate process capability and sigma level when the data is non-normal.

### Process Capability for Non Normal Data

When your data is normally distributed, we perform ‘Capability Analysis > Normal’ in Minitab to calculate process sigma. This capability analysis test assumes that the data is normal and accordingly calculates the Process Sigma (short term) and Cpk values.

However, when the data is non-normal, the same test cannot be used. The alternate test is ‘Capability Analysis > Non-Normal’. Below figure shows the path for this test.

One of the prerequisite for this test is to know the exact distribution that the data is following as evident from the below figure.

Contains 220+ LSS acronyms and abbreviations, a handy reference guide for all LSS Practitioners. And its FREE!

Hence, the first task is to identify the distribution of the data by using ‘Individual Distribution Identification’ test in Minitab.

The output of this test is multiple probability plots with p-values for each distribution that it tests.