Clustering algorithm for imbalanced data based on nearest neighbor

Sen WU; Yu-zhi WANG; Xiao-nan GAO

doi:10.13374/j.issn2095-9389.2019.10.09.003

工程科学学报 (Sep 2020)

Clustering algorithm for imbalanced data based on nearest neighbor

Sen WU,
Yu-zhi WANG,
Xiao-nan GAO

Affiliations

Sen WU: School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China
Yu-zhi WANG: School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China
Xiao-nan GAO: School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China

DOI: https://doi.org/10.13374/j.issn2095-9389.2019.10.09.003
Journal volume & issue: Vol. 42, no. 9
pp. 1209 – 1219

Abstract

Read online

Clustering is an important task in the field of data mining. Most clustering algorithms can effectively deal with the clustering problems of balanced datasets, but their processing ability is weak for imbalanced datasets. For example, K–means, a classical partition clustering algorithm, tends to produce a “uniform effect” when dealing with imbalanced datasets, i.e., the K–means algorithm often produces clusters that are relatively uniform in size when clustering unbalanced datasets with the data objects in small clusters “swallowing” the part of the data objects in large clusters. This means that the number and density of the data objects in different clusters tend to be the same. To solve the problem of “uniform effect” generated by the classical K–means algorithm in the clustering of imbalanced data, a clustering algorithm based on nearest neighbor (CABON) is proposed for imbalanced data. Firstly, the initial clustering of data objects is performed to obtain the undetermined-cluster set, which is defined as a set that consists of the data objects that must be checked further regarding the clusters in which they belong. Then, from the edge to the center of the set, the nearest-neighbor method is used to reassign the data objects in the undetermined-cluster set to the clusters of their nearest neighbors. Meanwhile the undetermined-cluster set is dynamically adjusted, to obtain the final clustering result, which prevents the influence of the “uniform effect” on the clustering result. The clustering results of the proposed algorithm is compared with that of K–means, the imbalanced K–means clustering method with multiple centers (MC_IK), and the coefficient of variation clustering for non-uniform data (CVCN) on synthetic and real datasets. The experimental results reveal that the CABON algorithm effectively reduces “uniform effect” generated by the K–means algorithm on imbalanced data, and its clustering result is superior to that of the K–means, MC_IK, and CVCN algorithms.

Published in 工程科学学报

ISSN: 2095-9389 (Print)
Publisher: Science Press
Country of publisher: China
LCC subjects: Technology: Mining engineering. Metallurgy; Technology: Engineering (General). Civil engineering (General): Environmental engineering
Website: https://cje.ustb.edu.cn/indexen.htm

About the journal

Abstract

Keywords