# Solving Statistics: Overpowered multivariable linear regression?

Overpowered multivariable linear regression?

## Question

In order to identify parameters that influence myocardial perfusion, a total of 70 patients were included in a multivariable linear regression model. A total of 19 different predictors were considered, of which 18 were measured on patient level (for example gender, age, BMI, smoking, medication, blood pressure) and 1 on coronary artery level (diameter of the coronary artery at stenosis).

The rule-of-thumb is that investigators may include no more than 1 variable in a multivariable regression model for every N=10 patients to have high enough power. Since all patients have three coronary arteries, a total of N=210 outcomes were measured. Nineteen is less than 21, so by the rule of thumb one might say that one has enough power. But what about the variables that do not vary within a patient, can we just ignore that? Did the authors apply the rule of thumb correct? If not, how could they have combined the three different measurements in coronary arteries with all other variables in a correct manner?

When I read about this analysis, what struck me the most was not the correct or incorrect application of the rule of thumb. No, the design of this study required a different type of analysis. Why? Well, a multiple linear regression model assumes independency of all “observations”. This does not hold for the 210 coronary artery measurements. In case of so-called multiple (or repeated) measurements within subjects, like here when all three coronary arteries are measured, you need more advanced statistical models. Mixed models are needed in this specific study, and if you ever come across repeated measures: consult a statistician beforehand. Of note, multiple or repeated measures also occur in longitudinal studies, when subjects are measured repeatedly over times.

In order for me to understand the researcher’s aim I looked up the original paper of this study. I was relieved to see that the investigators did. in fact use a mixed model. So my main concern was taken care of and the student’s second question (at least) partly answered.

Now what about the student’s first question: did the investigators apply the rule of thumb for including 19 variables correctly? Well, the answer to that question is simple: no, they did not! The total number of “observations” of the outcome measure was indeed 210, but there were only 70 patients included. And that 70 is the number we should use, allowing only a maximum of 7 independent variables in this model. How could the researchers then identify the factors that influence myocardial perfusion? They could have done this for example via a forward selection procedure, and add (one by one) only variables significantly related to perfusion. Now, the multiple regression model presented by the researchers also included variables that were non-significant. But do you, as researcher and as clinician, really want a model containing these non-significant predictors? To be honest, I don’t think so…

## Statistical analysis highlighted

In a simple linear regression model, we model the relationship between our outcome (or dependent) variable Y and a single predictor, or independent variable, X by a straight line: Y = b0 + b1X. Note the similarity with the mathematical formula y = ax + b you remember from high school. We can also model the relation between outcome Y and multiple different variables X1, X2, …, Xk simultaneously. The model then becomes: Y = b0 + b1X1 + b2X2 + … + bkXk. Since this model contains multiple independent variables, we refer to it as a multiple linear regression model, or a multivariable model. You also see the term multivariate regression analysis used regularly in this setting. That is, however, incorrect: multivariate models refer to models with multiple different outcome measures.

### B.I. Witte

Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands