Journal of Applied Science and Engineering (Jan 2024)
A Novel Clustering-Based Three Level Under-Sampling Algorithm for Class Imbalance Problem
Abstract
The class imbalance is an important topic of research as imbalance exists in many applications where the presence of one type of sample is significantly greater than that of another type. To overcome binary class imbalance problems, a hybrid under-sampling approach based on k-mean clustering and pseudo-oversampling is proposed. Random Over-Sampling Examples (ROSE) aids in re-balancing an imbalanced dataset by creating minority samples using a smooth bootstrap method, and k-means clustering is used for better sample selection as each cluster contains examples having similar characteristics. It reduces the chance of elimination of useful majorityclass samples. For performance evaluation, 25 publicly available imbalanced datasets are collected from the KEEL repository. The proposed method improves classification results in terms of sensitivity, specificity, G-mean, F-measure, balance accuracy, and accuracy as compared to three state of art clustering-based undersampling methods SBC, KMUS, and OBU. The experimental results of this research can be used in the classification of various domains, such as medical diagnosis, banking fraud detection, anomaly detection, etc, which are generally imbalanced.
Keywords