Results in Engineering (Dec 2024)
Knowledge distillation with resampling for imbalanced data classification: Enhancing predictive performance and explainability stability
Abstract
Machine learning classification models often struggle with imbalanced datasets, leading to poor performance in minority classes. While preprocessing approaches like resampling can improve minority class detection, they may introduce sampling bias and reduce model explainability. This study proposes a novel method combining random undersampling (RUS) with knowledge distillation (KD) to enhance both predictive performance and explainability stability for imbalanced data classification. Our approach employs a two-step learning process: (1) training a balanced teacher model using RUS and (2) training an imbalanced student model through response-based KD, utilizing both soft and hard targets. We hypothesize that this method mitigates class imbalance while preserving important information from the original dataset. We evaluated our proposed model against baseline and RUS-only models using five diverse imbalanced datasets from various domains. Performance was assessed using stratified 10-fold cross-validation with ROC-AUC and PR-AUC scores. Explainability stability was measured by the cosine similarity of SHAP values across cross-validation folds. Results demonstrate that our proposed model consistently outperforms both baseline and RUS-only models regarding ROC-AUC and PR-AUC scores across all datasets. Moreover, it exhibits superior explainability stability in the majority of cases, addressing the sampling bias issue associated with traditional resampling methods. This research contributes to the field of machine learning by offering a novel approach that simultaneously improves predictive performance and maintains explainability for imbalanced data classification.