STICA: Spanning-Tree-Based Identification of Clusters With Agglomeration for Numerical Data

Mozammel H. A. Khan

doi:10.1109/ACCESS.2024.3480460

IEEE Access (Jan 2024)

STICA: Spanning-Tree-Based Identification of Clusters With Agglomeration for Numerical Data

Mozammel H. A. Khan

Affiliations

Mozammel H. A. Khan: ORCiD; Department of Computer Science and Engineering, East West University, Aftabnagar, Dhaka, Bangladesh

DOI: https://doi.org/10.1109/ACCESS.2024.3480460
Journal volume & issue: Vol. 12
pp. 150430 – 150450

Abstract

Read online

Clustering is an unsupervised method used to group data points based on their similarity. It has many applications in data mining and preprocessing. Algorithms used for arbitrarily shaped clustering have several real-world applications. However, most of these algorithms require input parameters, such as the number of clusters, neighborhood radius, density threshold, and number of grids. However, it is difficult to estimate these parameters. A pragmatic approach is to develop a simple but parameterless clustering algorithm. To achieve this goal, we propose a parameterless algorithm for arbitrarily shaped clustering called spanning-tree-based identification of clusters (STIC) for numerical data. We consider data points as nodes of spanning trees (STs). We determined the STs’ edges based on the Euclidean distances between the node pairs and an internally estimated edge weight threshold. We generated clusters using breadth-first search (BFS) based on the notion of a spanning-tree-based representation of a cluster. Data points that were not visited by the BFS process were outliers. This algorithm is effective for datasets with inter-cluster separation. The STIC algorithm divides heterogeneous clusters into sub-clusters. In this case, we performed agglomerative merging of sub-clusters and outliers based on the single-linkage distances (SLDs) of the clusters and an internally estimated SLD threshold. This extended algorithm is called the STIC with agglomeration (STICA). Our experimental results with artificial datasets demonstrate that the STICA algorithm outperforms several widely used distance-based algorithms for arbitrarily shaped clusters. However, for real-world datasets, the STICA algorithm behaves similarly to other distance-based algorithms.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords