The distribution of a continuous numerical variable tells you something about how likely each possible outcome or groups of outcomes will occur. The distribution of a continuous numerical variable can be described as normal or non-normal. A normal distribution is the name for a specific mathematical concept that follows a symmetric bell shaped curve and is completely defined by its mean and standard deviation. Approximately 68% of the observations fall between one standard deviation below and one standard deviation above the mean. And approximately 95% of the observations fall between two standard deviations below and two standard deviations above the mean. Normal distribution is an important term within the field of statistics. If you have a normal distribution you can perform a certain number of statistical tests that you cannot perform if your distribution is non-normal.1
I am currently conducting a study on the effects of medication on people with Alzheimer’s disease. Part of this study is the collection of a number of numerical values. I wanted to perform a multiple linear regression with this data but the problem is that one variable is not normal distributed. How do I solve this problem? Is it possible to still preform the statistical test that I want?
Normally the advice would be to just use statistical tests that were designed for data with a non-normal distribution, but since you want to perform a multiple linear regression, a non-normal distribution makes this more difficult. However, when performing linear regression, the actual assumption is that the residuals will follow a normal distribution. Even if your variables do not follow a normal distribution, the residuals may do so. Hence, you can perform the linear regression analysis and then examine whether they follow a normal distribution.
If there are extreme values, known as outliers, in your data, they can skew the distribution to the left or right. For example, incorrect or extreme values may have a great effect on the distribution. You can try and see what the distribution will be if these values are eliminated. However, remember that you have to explain why it is acceptable to delete a value from your data. You should not just delete them because it makes your analysis easier!
A second way to solve your problem is to transform your data. A non-normal distribution may become closer to a normal distribution if the values are transformed using a mathematical formula. The logarithmic transformation is often used. For example, if your values are 12, 18, 27, 35 and 85 their natural logarithm values would be 2.48, 2.89, 3.30, 3.56 and 4.44. As you can see, these numbers are much closer together. For example, if you take a look at figure 1, you can see that there is a non-normal distribution when we look at the data. However, when we apply a natural logarithm it becomes a normal distribution that follows the symmetric bell curve.
Data before and after a natural logarithmic transformation
In figure 2 you can see the association between the natural logarithm and different outcomes. Notice how the larger an outcome gets the lesser impact it has on the natural logarithm and thus creating a workable data set.
The observed outcome is on the x (horizontal) axis and the natural logarithm transformed outcome on the y (vertical) axis. An original value of 2 is transformed to a value of 0.69 and an original value of 8 is transformed to a value of 2.08. You can also use the square root () transformation.
If you use a transformation, you may have to perform a back transformation to enable you to interpret the results of your statistical analysis.
The natural logarithmic transformation of the outcome
N. van de Klundert, Master student, Academic Medical Center, Amsterdam, the Netherlands
R. Holman, Clinical Research Unit, Academic Medical Center, Amsterdam, the Netherlands
- Petrie A, Sabin C. Medical Statistics at a glance. West sussex, UK: Wiley Blackwell; 2009.