Determining the optimal number of clusters by Enhanced Gap Statistic in K-mean algorithm

Iliyas Karim Khan; Hanita Binti Daud; Nooraini Binti Zainuddin; Rajalingam Sokkalingam; Muhammad Farooq; Muzammil Elahi Baig; Gohar Ayub; Mudasar Zafar

Egyptian Informatics Journal (Sep 2024)

Determining the optimal number of clusters by Enhanced Gap Statistic in K-mean algorithm

Iliyas Karim Khan,
Hanita Binti Daud,
Nooraini Binti Zainuddin,
Rajalingam Sokkalingam,
Muhammad Farooq,
Muzammil Elahi Baig,
Gohar Ayub,
Mudasar Zafar

Affiliations

Iliyas Karim Khan: Fundamental and Applied Science Department, Universiti Teknologi PETRONAS, Perak 32610, Malaysia; Corresponding author.
Hanita Binti Daud: Fundamental and Applied Science Department, Universiti Teknologi PETRONAS, Perak 32610, Malaysia
Nooraini Binti Zainuddin: Fundamental and Applied Science Department, Universiti Teknologi PETRONAS, Perak 32610, Malaysia
Rajalingam Sokkalingam: Fundamental and Applied Science Department, Universiti Teknologi PETRONAS, Perak 32610, Malaysia
Muhammad Farooq: Department of Statistics, University of Peshawar Khyber Pakhtunkhwa, Pakistan
Muzammil Elahi Baig: Department of Statistics, University of Peshawar Khyber Pakhtunkhwa, Pakistan
Gohar Ayub: Department of Statistics, University of Peshawar Khyber Pakhtunkhwa, Pakistan
Mudasar Zafar: Fundamental and Applied Science Department, Universiti Teknologi PETRONAS, Perak 32610, Malaysia

Journal volume & issue: Vol. 27
p. 100504

Abstract

Read online

Unsupervised learning, particularly K-means clustering, seeks to partition data into clusters with distinct intra-class cohesion and inter-class disparity. However, the arbitrary selection of clusters in K-means introduces challenges, leading to trial and error in determining the Optimal Number of Clusters (ONC). To address this, various methodologies have been devised, among which the Gap Statistic is prominent. Gap Statistic reliance on expected values for reference data selection poses limitations, especially in scenarios involving diverse scale, noise, and overlapping data.To tackle these challenges, this study introduces Enhanced Gap Statistic (EGS), which standardizes reference data using an exponential distribution within the Gap Statistic framework, integrating an adjustment factor for a more dependable estimation of the ONC. Application of EGS to K-means clustering facilitates accurate ONC determination. For comparison purposes, EGS is benchmarked against traditional Gap Statistic and other established methods used for ONC selection in K-means, evaluating accuracy and efficiency across datasets with varying characteristics. The results demonstrate EGS superior accuracy and efficiency, affirming its effectiveness in diverse data environments.

Published in Egyptian Informatics Journal

ISSN: 1110-8665 (Print)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.sciencedirect.com/journal/egyptian-informatics-journal

About the journal

Abstract

Keywords