Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization

Yiheng Chen; Jinbai Zou; Lihai Liu; Chuanbo Hu

doi:10.3390/sym16030273

Symmetry (Feb 2024)

Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization

Yiheng Chen,
Jinbai Zou,
Lihai Liu,
Chuanbo Hu

Affiliations

Yiheng Chen: School of Railway Transportation, Shanghai Institute of Technology, Shanghai 201418, China
Jinbai Zou: School of Railway Transportation, Shanghai Institute of Technology, Shanghai 201418, China
Lihai Liu: China Railway Siyuan Survey and Design Group Co., Ltd., Wuhan 430063, China
Chuanbo Hu: School of Railway Transportation, Shanghai Institute of Technology, Shanghai 201418, China

DOI: https://doi.org/10.3390/sym16030273
Journal volume & issue: Vol. 16, no. 3
p. 273

Abstract

Read online

The problems of imbalanced datasets are generally considered asymmetric issues. In asymmetric problems, artificial intelligence models may exhibit different biases or preferences when dealing with different classes. In the process of addressing class imbalance learning problems, the classification model will pay too much attention to the majority class samples and cannot guarantee the classification performance of the minority class samples, which might be more valuable. By synthesizing the minority class samples and changing the data distribution, unbalanced datasets can be optimized. Traditional oversampling algorithms have problems of blindness and boundary ambiguity when synthesizing new samples. A modified reclassification algorithm based on Gaussian distribution is put forward. First, the minority class samples are reclassified by the KNN algorithm. Then, different synthesis strategies are selected according to the combination of the minority class samples, and the Gaussian distribution is used to replace the uniform random distribution for interpolation operation under certain classification conditions to reduce the possibility of generating noise samples. The experimental results indicate that the proposed oversampling algorithm can achieve a performance improvement of 2∼8% in evaluation metrics, including G-mean, F-measure, and AUC, compared to traditional oversampling algorithms.

Published in Symmetry

ISSN: 2073-8994 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Mathematics
Website: http://www.mdpi.com/journal/symmetry/

About the journal

Abstract

Keywords