IEEE Access (Jan 2025)

An Improved Ensemble Method With Data Resampling for Credit Risk Prediction

  • Idowu Aruleba,
  • Yanxia Sun

DOI
https://doi.org/10.1109/ACCESS.2025.3563432
Journal volume & issue
Vol. 13
pp. 71275 – 71287

Abstract

Read online

The increasing complexity and dynamic nature of financial data present significant challenges in accurately predicting credit risk, a critical task in the banking and finance sector. The application of machine learning (ML) in credit risk prediction has been hindered by the imbalanced nature of credit datasets. This study proposes an improved approach for predicting credit risk using a stacked ensemble method combined with a hybrid data resampling technique. The ensemble comprises random forests, logistic regression, and a convolutional neural network (CNN) as base learners, with the multilayer perceptron (MLP) serving as a meta-learner. To address the data imbalance, the Synthetic Minority Over-sampling Technique and Edited Nearest Neighbors (SMOTE-ENN) technique were applied. The proposed approach is benchmarked against other well-performing classifiers, including random forest, logistic regression, MLP, and CNN. The integration of hybrid data resampling with a robust stacking ensemble significantly enhanced credit risk prediction, with the proposed approach achieving sensitivity and specificity of 0.921 and 0.946 for the Australian dataset and 0.928 and 0.891 for the German dataset. Also, the stacked classifier achieved a sensitivity and specificity of 0.000 and 1.000 before data resampling for the Credit Risk Classification dataset with an accuracy of 0.7644. After data resampling, the accuracy, sensitivity, and specificity are 0.8056, 0.7989 and 0.8125, respectively. On the other hand, using the credit risk analysis for the extended banking loans dataset, the accuracy, sensitivity and specificity of the stacked classifier before data resampling are 0.8429, 0.6316, and 0.9216, respectively. After data resampling, the accuracy, sensitivity and specificity scores of the stacked classifier trained using the credit risk analysis for the extended banking loans dataset are 0.9632, 1.0000, and 0.9242, respectively. This shows that after data resampling, the performance of the stacked classifier trained using the credit risk analysis for the extended banking loans dataset outperformed other models.

Keywords