Egyptian Informatics Journal (Dec 2022)
WeDIV – An improved k-means clustering algorithm with a weighted distance and a novel internal validation index
Abstract
Designing appropriate similarity metrics (distance) and estimating the optimal number of clusters have been two important issues in cluster analysis. This study proposed an improved k-means clustering algorithm involving a Weighted Distance and a novel Internal Validation index (WeDIV). The weighted distance, EP_dis, was designed by considering the relative contribution between Euclidean and Pearson distances with a weighted strategy. This strategy can effectively capture information reflecting the globally spatial correlation and locally variable trend simultaneously in high-dimensional space. The new internal validation index,RCH, inspired by the Calinski-Harabasz (CH) index and the analysis of variance, was developed to automatically estimate the optimal number of clusters. The EP_dis was proved reliable in mathematics and was validated on two simulated datasets. Four simulated datasets representing different properties were used to validate the effectiveness of RCH. Furthermore, We compared the clustering performance of WeDIV with 12 prevailing clustering algorithms on 16 UCI datasets. The results demonstrated that WeDIV outperforms the others regardless of specifying the number of clusters or not.