IEEE Access (Jan 2022)

A Calibrated Ensemble Algorithm to Address Data Heterogeneity in Machine Learning: An Application to Identify Severe SLE Flares in Lupus Patients

  • Yijun Zhao,
  • Man Qin,
  • April Jorge

DOI
https://doi.org/10.1109/ACCESS.2022.3149477
Journal volume & issue
Vol. 10
pp. 18720 – 18729

Abstract

Read online

Motivated to address the inconsistency between the essential i.i.d. assumption in machine learning theory and the data heterogeneity in real-world applications, we propose a novel calibrated ensemble (CE) algorithm to facilitate learning with diverse data subgroups. Unlike the traditional ensemble framework in which each learner is trained independently using the entire dataset, our method exploits the strengths of various machine learning models by training them simultaneously and forming model-ergonomic data subgroups as part of the training process. Consequently, each learner is calibrated to a unique subset of data based on their individualized predictive strength. Clinically, we can interpret each model as an expert specializing in treating patients with particular disease manifestations. We evaluate the CE model in our motivating domain of identifying lupus patients with severe SLE flares using 1541 clinical encounters in the Mass General Brigham (MGB) Lupus Cohort. Our experimental results demonstrate the efficacy of our CE model across seven performance evaluation metrics compared to five individual machine learning models and regular ensemble approaches. We further utilize ANOVA and Tukey HSD post-hoc statistical analysis to discover characteristic features of individual model clusters for clinical interpretations.

Keywords