A hypothesis test is a method of statistical inference on sets of random variables, such as Hand Eczema Severity Index (HECSI) scores obtained from patients with hand eczema following treatment with hand creams “handy help” or “silky smooth”. If the values of the HECSI are normally distributed, a researcher can use Student’s t-test to compare the mean HECSI scores in both groups. The null hypothesis is that the mean scores are equal in both groups and the alternative hypothesis that the mean scores are not equal.
The researcher may be particularly interested in whether the p-value of the hypothesis test is less than 0.05. A p-value less than 0.05 indicates that the probability of observing a given difference between the mean HECSI scores would be less than 0.05 if there was no difference between the severity of hand eczema following treatment with “handy help” or “silky smooth”. If the researcher finds a p-value less than 0.05, he or she will probably declare that there is a significant difference between mean HECSI scores following treatment with “handy help” and “silky smooth”.
However, imagine that the researcher has data on not one, but six measures of hand eczema severity. Then the probability of observing at least one significant difference between the two groups rises substantially, even if the underlying distributions of HECSI scores are identical. Statisticians have developed methods, including the Bonferroni correction, to correct for this.
I am analyzing a randomized controlled trial for two new treatments for hand eczema, “handy help” and “silky smooth”. I have six primary endpoints: 1) the Hand Eczema Severity Index; 2) the Physician Global Assessment; 3) the Dermatology Life Quality Index; 4) the Photographic Guide; 5) the Osnabrueck Hand Eczema Severity Index; and 6) the Investigators’ Global Assessment . I have heard that I need to correct for multiple testing. What does this mean and how should I do this?
Imagine that you have a box of coins, each of which has probability 0.5 of landing on heads if tossed. If you toss one coin 10 times, the probability that the coin will land on heads 10 times is 0.510, which is approximately one in a thousand. However, if you toss 100 coins 10 times each, the probability that at least one will land on heads 10 times is 0.0930, which is nearly one in ten. Hence, if you carry out an experiment multiple times, or use multiple measures in a single experiment, a rare outcome becomes much more common.
The same effect can be seen in multiple testing in clinical research. If you have six measures of hand eczema severity as primary endpoints in your trial, the probability that at least one difference between the “handy help” and “silky smooth” groups will be significant, even if there is really no difference between the groups, will be 1- (1-0.05)6 = or 0.2649 . This is more than one in four and substantially more than the nominal value of 0.05.
One way of correcting for multiple testing is the Bonferroni correction . When performing a Bonferroni correction, a researcher divides the critical value, or α, by the number of tests, n, that he or she wishes to perform. This division results in a new cut-off for significance, α/n. In our example the researcher wishes to examine differences on six measures of hand eczema. Hence, following the Bonferroni correction, the critical value will be 0.05/6 = 0.0083. We can use the formula above to evaluate the probability of observing a significant difference even if there is no underlying difference between the two groups of patients. Here the probability of finding at least one significant result is 1 – (1-0.0083)6 = 0.0489, which is slightly less than the nominal value of 0.05. This indicates that the Bonferroni correction is conservative, especially when the number of tests on the same database increases. This means that it corrects more strictly than is actually necessary.
There are also other methods of dealing with multiple endpoints. Three options  are: 1) defining a single outcome, such as the HECSI, as the primary endpoint, eliminating the need to correct for multiple testing; 2) combining the outcomes into a single aggregated endpoint, such as the average of the six hand eczema severity scales for each patient; or 3) using statistical methods specifically developed for testing multiple endpoints such as multivariate analysis of variance (MANOVA), Hotelling’s T2 test or global or exact test statistics . The problem of multiple testing can also occur when researchers wish to compare more than two groups or perform subgroup or interim analyses or have repeated measurements on individual patients , but are beyond the scope of this article.
I. Vogel 1 & R. Holman 2
1 Medical student, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands
2 Clinical Research Unit, Academic Medical Center, Amsterdam, The Netherlands
- Duman N, Uzunali E. Clinical assessment of the severity of chronic hand eczema: correlations between six assessment methods Eur Res J 2015;1(2):44-49.
- Dmitrienko A, D’Agostino R. Traditional multiplicity adjustment methods in clinical trials Stat Med 2013, 32 5172–5218.
- Bender R, Lange S. Adjusting for multiple testing-when and how? J Clin Epi, 2001, 54 323-349.