IEEE Access (Jan 2024)
STICA: Spanning-Tree-Based Identification of Clusters With Agglomeration for Numerical Data
Abstract
Clustering is an unsupervised method used to group data points based on their similarity. It has many applications in data mining and preprocessing. Algorithms used for arbitrarily shaped clustering have several real-world applications. However, most of these algorithms require input parameters, such as the number of clusters, neighborhood radius, density threshold, and number of grids. However, it is difficult to estimate these parameters. A pragmatic approach is to develop a simple but parameterless clustering algorithm. To achieve this goal, we propose a parameterless algorithm for arbitrarily shaped clustering called spanning-tree-based identification of clusters (STIC) for numerical data. We consider data points as nodes of spanning trees (STs). We determined the STs’ edges based on the Euclidean distances between the node pairs and an internally estimated edge weight threshold. We generated clusters using breadth-first search (BFS) based on the notion of a spanning-tree-based representation of a cluster. Data points that were not visited by the BFS process were outliers. This algorithm is effective for datasets with inter-cluster separation. The STIC algorithm divides heterogeneous clusters into sub-clusters. In this case, we performed agglomerative merging of sub-clusters and outliers based on the single-linkage distances (SLDs) of the clusters and an internally estimated SLD threshold. This extended algorithm is called the STIC with agglomeration (STICA). Our experimental results with artificial datasets demonstrate that the STICA algorithm outperforms several widely used distance-based algorithms for arbitrarily shaped clusters. However, for real-world datasets, the STICA algorithm behaves similarly to other distance-based algorithms.
Keywords