Resampling algorithm for imbalanced data based on their neighbor relationship

Rui-feng LI; Wen-hai LI; Yan-li SUN; Yang-yong WU

doi:10.13374/j.issn2095-9389.2020.04.05.002

工程科学学报 (Jun 2021)

Resampling algorithm for imbalanced data based on their neighbor relationship

Rui-feng LI,
Wen-hai LI,
Yan-li SUN,
Yang-yong WU

Affiliations

Rui-feng LI: Naval Aviation University, Yantai 264001, China
Wen-hai LI: Naval Aviation University, Yantai 264001, China
Yan-li SUN: Naval Aviation University, Yantai 264001, China
Yang-yong WU: Naval Aviation University, Yantai 264001, China

DOI: https://doi.org/10.13374/j.issn2095-9389.2020.04.05.002
Journal volume & issue: Vol. 43, no. 6
pp. 862 – 869

Abstract

Read online

The classification of imbalanced data has become a crucial and significant research issue in many data-intensive applications. The minority samples in such applications usually contain important information. This information plays an important role in data analysis. At present, two methods (improved algorithm and data set reconstruction) are used in machine learning and data mining to address the data set imbalance. Data set reconstruction is also known as the resampling method, which can modify the proportion of every class in the training data set without modifying the classification algorithm and has been widely used. As artificially increasing or reducing samples inevitably results in the increase in noise and loss of original data information, thus reducing the classification accuracy. A reasonable oversampling and undersampling algorithm are the core of the resampling method. To improve the classification accuracy of imbalanced data sets, a resampling algorithm based on the neighbor relationship of sample space was proposed. This method first evaluated the security level according to the spatial neighbor relations of minority samples and oversampled them through the synthetic minority oversampling technique guided by their security level. Then, the local density of majority samples was calculated according to their spatial neighbor relation to undersample the majority samples in a sample-intensive area. By the above two means, the data set can be balanced and the data size can be controlled to prevent overfitting to realize the classification equalization of the two categories. The training set and test set were generated via the method of 5 × 10 fold cross validation. After resampling the training set, the kernel extreme learning machine (KELM) was used as the classifier for training, and the test set was used for verification. The experimental results on a UCI imbalanced data set and measured circuit fault diagnosis data show that the proposed method is superior to other resampling algorithms.

Published in 工程科学学报

ISSN: 2095-9389 (Print)
Publisher: Science Press
Country of publisher: China
LCC subjects: Technology: Mining engineering. Metallurgy; Technology: Engineering (General). Civil engineering (General): Environmental engineering
Website: https://cje.ustb.edu.cn/indexen.htm

About the journal

Abstract

Keywords