Journal of Clinical and Diagnostic Research (Jul 2022)

Application of Principal Component Analysis in Dealing with Multicollinearity in Modelling Clinical Data

  • Akash Mishra,
  • N Sreekumaran Nair,
  • KT Harichandrakumar,
  • VS Binu,
  • Santhosh Satheesh

DOI
https://doi.org/10.7860/JCDR/2022/55379.16629
Journal volume & issue
Vol. 16, no. 7
pp. YC15 – YC19

Abstract

Read online

Introduction: One of the stringent assumptions about covariates in the Cox hazard and Logistic regression modelling is that they should be independent. Incorporating correlated covariates as such into the model might distort the precision of the estimates due to multicollinearity. One way to deal with multicollinearity is by using Principal Component Analysis (PCA) technique. Aim: To demonstrate the application of PCA in dealing with correlated covariates while modelling time to event and case-control study data. Materials and Methods: This study was conducted at Jawaharlal Institute of Postgraduate Medical Education and Research, Puducherry, India, from February 2021 to January 2022. Two datasets were used for the demonstration i.e., data relates to a time to event outcome and a case-control study with binary outcome in which lipids were the correlated covariates. Three sets of Cox regression models were used to demonstrate change in hazard ratios with 95% Confidence Intervals (CI) for evaluating the effect of intervention at a different time of lipid measurement. Model I has evaluated treatment/Body Mass Index (BMI) effect on the outcome by ignoring the effect of lipid parameters. Model II has evaluated treatment/BMI effect on the outcome by incorporating lipid variables but ignoring multicollinearity. Model III has evaluated treatment/ BMI effect on the outcome by incorporating lipid variables through principal component analysis and thus adjusting for multicollinearity. Similarly, a logistic regression model was performed by using the same three sets of models to evaluate the effect of exposure (BMI). The comparability of lipids between the two groups for both datasets was tested using Hotelling’s T-squared statistic. Results: The lipids measured at 12th, 24th and 36th months between the two groups in the first data set as well as between cases and controls in the second data set were statistically significant. In the first dataset, at baseline, the Hazard Ratio’s (HR’s) were statistically similar irrespective of the models used; while decreasing successively with narrowing 95% CI’s as moving from model I to model III for the lipid measured at 12th, 24th and 36th months. Further, at 24th and 36th months, the HR in model-III found to be significant. In the second data set, the Odds Ratio (OR) were significant for all the three models and it was almost similar for model I and II but in model III it was elevated. Conclusion: The multicollinearity issue should be properly addressed before including correlated covariates in the Cox regression hazard and Logistic regression model. The PCA technique would be a favourable method.

Keywords