skip to Main Content

Solving Statistics: Should I use Pearson’s or Spearman’s correlation coefficient

Should I correct for multiple testing and what is the Bonferroni correction?

Background

A correlation coefficient is a number that shows if there is an association between variables. It provides an indication of the association between two variables X and Y (1). Two types of correlations are commonly used for continuous, numerical data. These are Pearson’s correlation coefficient and Spearman’s rank correlation coefficient.

Both types are only suitable for describing the association between X and Y if the relationship between them is linear. A linear relationship means that if you were to plot the values of X against Y in a scatterplot, the points representing the data will fall around a straight line. If they fall around a curved or an angled line, the relationship is non-linear.

There are a number of differences between the two coefficients. In this paper, we focus on outliers. Outliers are values in a variable that, in some way, fall a substantial distance from the other values.

Pearson’s correlation coefficient should only be used if neither X nor Y have any outliers. Spearman’s rank correlation coefficient can be used even if neither, one or both of X or Y have outliers.

Question

I want to determine whether patients, who have a higher body mass index, tend to have a higher systolic blood pressure than those who have a lower body mass index. I know I have to examine the correlation between blood pressure and body mass index. The problem is that I do not know which correlation coefficient I should calculate. I need to choose between Pearson’s correlation coefficient and Spearman’s rank correlation coefficient. Can you help me to figure out which one I should use?

Answer

Let us look at some examples. In Figure 1, we present histograms of body mass index for healthy subjects and patients. The histogram for healthy subjects shows that all of the points are fairly close together. We can conclude that body mass index for healthy subjects does not have any outliers. Hence, Pearson’s correlation coefficient is suitable for representing the association between body mass index and systolic blood pressure for healthy subjects. However, keep in mind that there also needs to be a linear relationship between body mass index and systolic blood pressure.

If we look at the second histogram for body mass index, that examines patients, it shows that there are three values that are substantially larger than the other values. These values are outliers. Hence, we can conclude that body mass index for patients does not follow a normal distribution and that Pearson’s correlations coefficient is not suitable for representing this association for patients. Spearman’s correlation coefficient is suitable if there is a linear relationship between body mass index and systolic blood pressure for patients.

In Figure 2, we present scatterplots of body mass index against systolic blood pressure for students and lecturers. For students, there is a clear linear relationship between these variables. This suggests that presenting the value of a correlation coefficient is appropriate. Because there are no substantial outliers, both Pearson’s and Spearman’s correlation coefficients would be appropriate to use.

You may hope to move on from correlation to linear regression, perhaps with more explanatory variables. In this case, Pearson’s correlation coefficient may be more appropriate, as it is mathematically closely linked to linear regression.

When you have calculated either correlation coefficient, the value will be between -1 and +1. If your correlation coefficient is between 0 and 1, it means that when one variable increases, the other increases as well. Your variables are positively correlated. If your correlation coefficient is between -1 and 0, it means that when one variable increases, the other decreases. Your variables are negatively correlated. If your correlation coefficient equals 1 or -1, all points derived from plotting your variables in a scatter plot will fall on a perfect straight line. Your values are then perfectly correlated; this is nearly impossible to find in real life data.

Figure 1: Histograms of body mass index for healthy subjects and patients

Figure 2: The relationship between body mass index and systolic blood pressure for students and lecturers

N. van de Klundert 1 & R. Holman 2

1 Medical student, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands

2 Clinical Research Unit, Academic Medical Center, Amsterdam, The Netherlands

References

  1. Petrie A, Sabin C. Medical Statistics at a glance. West sussex, UK: Wiley Blackwell; 2009.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top