An adaptive outlier removal aided k-means clustering algorithm

Nawaf H.M.M. Shrifan; Muhammad F. Akbar; Nor Ashidi Mat Isa

Journal of King Saud University: Computer and Information Sciences (Sep 2022)

An adaptive outlier removal aided k-means clustering algorithm

Nawaf H.M.M. Shrifan,
Muhammad F. Akbar,
Nor Ashidi Mat Isa

Affiliations

Nawaf H.M.M. Shrifan: School of Electrical and Electronic Engineering, Engineering Campus, Universiti Sains Malaysia, 14300 Nibong Tebal, Pulau Pinang, Malaysia; Faculty of Oil and Minerals, University of Aden, Shabwah, Yemen
Muhammad F. Akbar: School of Electrical and Electronic Engineering, Engineering Campus, Universiti Sains Malaysia, 14300 Nibong Tebal, Pulau Pinang, Malaysia
Nor Ashidi Mat Isa: School of Electrical and Electronic Engineering, Engineering Campus, Universiti Sains Malaysia, 14300 Nibong Tebal, Pulau Pinang, Malaysia; Corresponding author.

Journal volume & issue: Vol. 34, no. 8
pp. 6365 – 6376

Abstract

Read online

K-means is one of ten popular clustering algorithms. However, k-means performs poorly due to the presence of outliers in real datasets. Besides, a different distance metric makes a variation in data clustering accuracy. Improve the clustering accuracy of k-means is still an active topic among researchers of the data clustering community from outliers removal and distance metrics perspectives. Herein, a novel modification of the k-means algorithm is proposed based on Tukey’s rule in conjunction with a new distance metric. The standard Tukey rule is modified to remove the outliers adaptively by considering whether the data is distributed to the left, right or even to the input data's mean value. The elimination of outliers is applied in the proposed modification of the k-means before calculating the centroids to minimize the outliers' influences. Meanwhile, a new distance metric is proposed to assign each data point to the nearest cluster. In this research, the modified k-means significantly improves the clustering accuracy and centroids convergence. Moreover, the proposed distance metric's overall performance outperforms most of the literature distance metrics. This manuscript's presented work demonstrates the significance of the proposed technique to improve the overall clustering accuracy up to 80.57% on nine standard multivariate datasets.

Published in Journal of King Saud University: Computer and Information Sciences

ISSN: 1319-1578 (Print)
Publisher: Elsevier
Country of publisher: Saudi Arabia
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.journals.elsevier.com/journal-of-king-saud-university-computer-and-information-sciences/

About the journal

Abstract

Keywords