No statistical question this time. Partly because I wasn’t asked a question, but also because of a recent Science Cafe organised by the Young Statisticians of the Netherlands Society for Statistics and Operations Research. The subject of this evening was ‘To p or not to p?’, referring to the widely reported as well as misinterpreted and even mistrusted p-value (medical) researchers love to report. Unfortunately I was unable to attend, it would have been an interesting topic. A discussion with medical students at the VUmc during a statistics course made me wonder: “Shouldn’t we actually start with explaining basic statistical terminology before we can start performing analyses?”. Therefore, this ‘statistical problem’ is about p-values (and a little bit about confidence intervals).
The p-value comes into play when testing a null hypothesis against an alternative hypothesis and is formally defined as the probability of the observed result or even more extreme results, if the null hypothesis were true. This observed result could for example be the mean difference when comparing a continuous outcome between two groups or the odds ratio of one group compared to another group for a dichotomous outcome. So we’ve defined the p-value, but what does it tell us? And more importantly what does it not tell us?
The first misconception of the p-value is that it equals the probability that null hypothesis is true. This is not the case; the p-value is calculated under the assumption that the null hypothesis is true, hence it cannot be the probability that it is true.
Based on the p-value, the null hypothesis is either rejected or not at a certain significance level α, usually 5%. Rejecting the null hypothesis means that the alternative hypothesis is accepted, whereas not rejecting the null hypothesis does not mean it is accepted. As in court, statistics works under the assumption ‘innocent until proven guilty’. But if there is not enough evidence it doesn’t mean a suspect really is innocent, for now there is just not enough evidence he is guilty! In statistics, the p-value is a measure for the evidence, and then a non-significant p-value only implies that there is not enough evidence to reject the null hypothesis, at least based on the sample under consideration. On the other hand, there could be enough evidence to reject the null hypothesis (and accept the alternative hypothesis) whereas the null hypothesis is in fact really true. The probability that the null hypothesis is falsely rejected is exactly equal to the significance level α.
The standard significance level of 5% is however quite arbitrary. Imagine two different samples obtained under exactly the same conditions, where for the first sample a p-value of 0.049 was observed and for the second sample of 0.051. Based on the first sample the null hypothesis would be rejected, based on the second it would not. If the two samples were of equal size, this difference can only be explained by chance. Repeating the study on a new sample over and over will give you some idea whether you can reject the null hypothesis or not, but you never know for sure. So when you come across a p-value close to the significance level, be cautious with your conclusion, it’s not as black and white as it might look!
Furthermore, p-values do not give any information about the size of the effect: two different samples may yield the same effect but not the same p-value, or vice versa. Moreover, a statistically significant effect is not necessarily a clinically relevant effect. In general p-values are therefore reported together with estimated effect sizes and corresponding confidence intervals. Hypotheses can also be rejected solely based on confidence intervals: if the null effect is not contained in the 95% confidence interval, the null hypothesis can be rejected at the 5% significance level. Why do we then report both the p-value and the confidence interval in a research paper? Well, the confidence interval gives us an idea about the magnitude of the effect, but not about the magnitude of the evidence. Off course it holds that the narrower the interval, the more certain we are about the estimated effect. But what is narrow? In any case, a p-value of 0.045 is much larger than one of 0.004…
Unfortunately, confidence intervals are also misinterpreted quite often. The definition of a 95% confidence interval is given in the ‘Statistical terminology highlighted’-box. Note that it does not mean that there is a 95% probability that the interval covers the true (i.e. population) effect, nor that 95% of the sample data lie within the interval. It may however be understood as an interval estimate of plausible values for the true effect.
In conclusion, p-values (and to lesser extent confidence intervals) are complex, sometimes counterintuitive measures. But as long as journals insist on reporting them and other methods, like Bayesian methods, are also mistrusted they will persist in (clinical) research. Hopefully, this ‘statistical problem’ can contribute in their understanding.
Statistical terminology highlighted
p-value: the probability of the observed effect or even more extreme effects if the null hypothesis were true.
95% confidence interval: if the study is repeated many times, 95% of the intervals will contain the true value of the population effect/parameter.
Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands