IEEE Access (Jan 2021)

Automatic Data Clustering Framework Using Nature-Inspired Binary Optimization Algorithms

  • Behnaz Merikhi,
  • M. R. Soleymani

DOI
https://doi.org/10.1109/ACCESS.2021.3091397
Journal volume & issue
Vol. 9
pp. 93703 – 93722

Abstract

Read online

Cluster analysis using metaheuristic algorithms has earned increasing popularity over recent years due to the great success of these algorithms in finding high-quality clusters in complex real-world problems. This paper proposes a novel framework for automatic data clustering with the capability of generating clusters with approximately the same maximum distortion using nature-inspired binary optimization algorithms. The inherent problem with clustering using such algorithms is having a huge search space. Therefore, we have also proposed a binary encoding scheme for the particle representation to alleviate this problem. The proposed clustering solution requires no prior knowledge of the number of clusters and proceed with the process based on re-clustering, merging, and modifying the small clusters to compensate for the distortion gap between groups with different sizes. The proposed framework’s performance has been evaluated over a wide range of synthetic, real-life, and higher dimensional datasets first by considering four different binary optimization algorithms for the optimizer module. Then, it has also been compared to multiple classical and new clustering solutions and two other automatic clustering techniques in continuous search space in terms of separation and compactness of the clusters by utilizing internal validity measures. The experimental results show the proposed solution is highly efficient in creating well-separated and compact clusters with approximately the same distortion in most datasets. Moreover, the application of the proposed framework to the correlated binary dataset is also reported as a case study. The presence of correlation in a dataset results from the similarity between data points in the same category, such as repeated measurements in remote sensing, crowdsourced multi-view video uploading, and augmented reality. Simplicity, customizability, and flexibility in adding extra conditions to the proposed solution and having a dynamic number of clusters are the advantages of the proposed framework.

Keywords