Scientific Reports (Sep 2024)

Using random forest and biomarkers for differentiating COVID-19 and Mycoplasma pneumoniae  infections

  • Xun Zhou,
  • Jie Zhang,
  • Xiu-Mei Deng,
  • Fang-Mei Fu,
  • Juan-Min Wang,
  • Zhong-Yuan Zhang,
  • Xian-Qiang Zhang,
  • Yue-Xing Luo,
  • Shi-Yan Zhang

DOI
https://doi.org/10.1038/s41598-024-74057-5
Journal volume & issue
Vol. 14, no. 1
pp. 1 – 14

Abstract

Read online

Abstract The COVID-19 pandemic has underscored the critical need for precise diagnostic methods to distinguish between similar respiratory infections, such as COVID-19 and Mycoplasma pneumoniae (MP). Identifying key biomarkers and utilizing machine learning techniques, such as random forest analysis, can significantly improve diagnostic accuracy. We conducted a retrospective analysis of clinical and laboratory data from 214 patients with acute respiratory infections, collected between October 2022 and October 2023 at the Second Hospital of Nanping. The study population was categorized into three groups: COVID-19 positive (n = 52), MP positive (n = 140), and co-infected (n = 22). Key biomarkers, including C-reactive protein (CRP), procalcitonin (PCT), interleukin- 6 (IL-6), and white blood cell (WBC) counts, were evaluated. Correlation analyses were conducted to assess relationships between biomarkers within each group. The random forest analysis was applied to evaluate the discriminative power of these biomarkers. The random forest model demonstrated high classification performance, with area under the ROC curve (AUC) scores of 0.86 (95% CI: 0.70–0.97) for COVID-19, 0.79 (95% CI: 0.64–0.92) for MP, 0.69 (95% CI: 0.50–0.87) for co-infections, and 0.90 (95% CI: 0.83–0.95) for the micro-average ROC. Additionally, the precision-recall curve for the random forest classifier showed a micro-average AUC of 0.80 (95% CI: 0.69–0.91). Confusion matrices highlighted the model’s accuracy (0.77) and biomarker relationships. The SHAP feature importance analysis indicated that age (0.27), CRP (0.25), IL6 (0.14), and PCT (0.14) were the most significant predictors. The integration of computational methods, particularly random forest analysis, in evaluating clinical and biomarker data presents a promising approach for enhancing diagnostic processes for infectious diseases. Our findings support the use of specific biomarkers in differentiating between COVID-19 and MP, potentially leading to more targeted and effective diagnostic strategies. This study underscores the potential of machine learning techniques in improving disease classification in the era of precision medicine.

Keywords