Egyptian Informatics Journal (Sep 2024)
Determining the optimal number of clusters by Enhanced Gap Statistic in K-mean algorithm
Abstract
Unsupervised learning, particularly K-means clustering, seeks to partition data into clusters with distinct intra-class cohesion and inter-class disparity. However, the arbitrary selection of clusters in K-means introduces challenges, leading to trial and error in determining the Optimal Number of Clusters (ONC). To address this, various methodologies have been devised, among which the Gap Statistic is prominent. Gap Statistic reliance on expected values for reference data selection poses limitations, especially in scenarios involving diverse scale, noise, and overlapping data.To tackle these challenges, this study introduces Enhanced Gap Statistic (EGS), which standardizes reference data using an exponential distribution within the Gap Statistic framework, integrating an adjustment factor for a more dependable estimation of the ONC. Application of EGS to K-means clustering facilitates accurate ONC determination. For comparison purposes, EGS is benchmarked against traditional Gap Statistic and other established methods used for ONC selection in K-means, evaluating accuracy and efficiency across datasets with varying characteristics. The results demonstrate EGS superior accuracy and efficiency, affirming its effectiveness in diverse data environments.