BMC Medical Research Methodology (Sep 2024)

Handling missing data and measurement error for early-onset myopia risk prediction models

  • Hongyu Lai,
  • Kaiye Gao,
  • Meiyan Li,
  • Tao Li,
  • Xiaodong Zhou,
  • Xingtao Zhou,
  • Hui Guo,
  • Bo Fu

DOI
https://doi.org/10.1186/s12874-024-02319-x
Journal volume & issue
Vol. 24, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Background Early identification of children at high risk of developing myopia is essential to prevent myopia progression by introducing timely interventions. However, missing data and measurement error (ME) are common challenges in risk prediction modelling that can introduce bias in myopia prediction. Methods We explore four imputation methods to address missing data and ME: single imputation (SI), multiple imputation under missing at random (MI-MAR), multiple imputation with calibration procedure (MI-ME), and multiple imputation under missing not at random (MI-MNAR). We compare four machine-learning models (Decision Tree, Naive Bayes, Random Forest, and Xgboost) and three statistical models (logistic regression, stepwise logistic regression, and least absolute shrinkage and selection operator logistic regression) in myopia risk prediction. We apply these models to the Shanghai Jinshan Myopia Cohort Study and also conduct a simulation study to investigate the impact of missing mechanisms, the degree of ME, and the importance of predictors on model performance. Model performance is evaluated using the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). Results Our findings indicate that in scenarios with missing data and ME, using MI-ME in combination with logistic regression yields the best prediction results. In scenarios without ME, employing MI-MAR to handle missing data outperforms SI regardless of the missing mechanisms. When ME has a greater impact on prediction than missing data, the relative advantage of MI-MAR diminishes, and MI-ME becomes more superior. Furthermore, our results demonstrate that statistical models exhibit better prediction performance than machine-learning models. Conclusion MI-ME emerges as a reliable method for handling missing data and ME in important predictors for early-onset myopia risk prediction.

Keywords