Journal of King Saud University: Computer and Information Sciences (Sep 2022)
An adaptive outlier removal aided k-means clustering algorithm
Abstract
K-means is one of ten popular clustering algorithms. However, k-means performs poorly due to the presence of outliers in real datasets. Besides, a different distance metric makes a variation in data clustering accuracy. Improve the clustering accuracy of k-means is still an active topic among researchers of the data clustering community from outliers removal and distance metrics perspectives. Herein, a novel modification of the k-means algorithm is proposed based on Tukey’s rule in conjunction with a new distance metric. The standard Tukey rule is modified to remove the outliers adaptively by considering whether the data is distributed to the left, right or even to the input data's mean value. The elimination of outliers is applied in the proposed modification of the k-means before calculating the centroids to minimize the outliers' influences. Meanwhile, a new distance metric is proposed to assign each data point to the nearest cluster. In this research, the modified k-means significantly improves the clustering accuracy and centroids convergence. Moreover, the proposed distance metric's overall performance outperforms most of the literature distance metrics. This manuscript's presented work demonstrates the significance of the proposed technique to improve the overall clustering accuracy up to 80.57% on nine standard multivariate datasets.