IEEE Access (Jan 2023)
A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification
Abstract
The imbalanced data classification presents pervasive challenges in real-world data mining scenarios. To tackle these challenges, sampling techniques have emerged as effective approaches. However, the prevailing technique, SMOTE (Synthetic Minority Over-sampling Technique), and its derivatives make the assumption that each minority class observation carries an equal amount of information, neglecting the distribution of minority class observations and their relationship with neighboring majority class observations. Consequently, the synthetic samples generated by these methods deviate from the original data distribution, resulting in an increased overlap with the majority samples. To address this limitation, we introduce a novel sampling technique called Combined Priori and Purity Gaussian OverSampling (PPGO) in this paper. The proposed method incorporates prior probabilities and sample purity to calculate the weight assigned to each minority class sample. This weight is used to determine the quantity of synthetic samples to be generated for each minority class, as well as the level of dispersion during the Gaussian sampling process. This approach aims to restore the original distribution of the observations and minimize the overlap with the majority class regions. The experimental results demonstrate a significant improvement in the G-mean and AUC measures when using the proposed method compared to conventional approaches. These results were obtained through experiments conducted on 32 datasets obtained from the KEEL repository.
Keywords