Scatter plots

Background

This time there was no question about the statistics of a certain paper. Therefore, the editor asked me if I could elaborate on a specific type of figures for this edition of the AMSj: the scatter plot.

Question

When can a scatter plot be used and when not, how should the plot be interpreted, are there any alternatives? And what kind of variables can be used, and can it include more than two variables?

Figure 2 Scatter plot of sex (males at the left, females at the right) of 182 fictitious subjects. A) Without jitter. B) With random jitter.

With a scatter plot, values of two variables for all subjects in the study are plotted: one variable is plotted on the x-axis, the other variable is plotted on the y-axis. FIGURE 1A illustrates a scatter plot of the age and cholesterol of 182 subjects. The data is fictitious, that is it is simulated, for educational purposes only. The scatter plot gives some insight in the relation between the two variables, which is usually tested statistically by either a correlation coefficient, or a linear regression model. The estimated regression line is sometimes added to the scatter plot. This is usually done when the purpose of the study was to predict the outcome (or dependent
variable, on the y-axis) using the determinant (or independent variable, on the x-axis).

The closer the points are to an imaginary line, the stronger the relation. This relation can either be positive or negative. If the points show an upward trend, the relation between the two variables is positive (at least if for both variables larger values mean higher/better), if the points show a downward trend, the relation between the two variables is negative. If there is no upward or downward trend, there is no relation between the two variables.

To investigate and/or illustrate confounding or effect modification of the association between the outcome variable and the determinant, the points in the scatter plot are sometimes marked by a third (categorical) variable (the confounder or effect modifier). FIGURE 1B illustrates the relation between cholesterol and age for the same 182 subjects, but males and females have a different marker: blue dots versus red daggers, respectively. By only looking at the blue dots for the males, it seems that there is hardly any relation between age and cholesterol, while by looking at the red daggers for the females, it seems there is a relation between the two. The blue and red regression lines confirm this effect modification by sex. If a systematic difference between the blue dots andred daggers would be present, this would indicate confounding by sex. A forth (categorical) variable could be added, by keeping the markers the same within groups of the third variable and the colour within groups of the forth variable. But this complicates the plot, which makes it more difficult to interpret from one glance, so I would advice to only include up to three variables in the scatterplot.

A scatter plot can also be used to describe the relation between three continuous variables, resulting in a three-dimensional picture. SPSS can create such a plot, but only from a fixed angle, which does not provide a complete picture of the relation between the variables.

In general, scatter plots are used to graphically display the relation between two continuous variables, but it can also be used to describe the relation between a categorical and a continuous outcome. FIGURE 2A shows the scatter plot relation cholesterol to sex. The points now form two vertical lines, and because many points overlap now, it is difficult to interpret the plot. Therefore, sometime a random jitter is added to the categories of the independent variable, to distinguish individual points. This is illustrated in FIGURE 2B. Then, finally, is there an alternative to the scatter plot? The answer to that question is short and simple: no, there is no alternative. Visualizing the relation between two continuous variables cannot be done via another type of plot.