Journal of King Saud University: Computer and Information Sciences (Jan 2023)
Big data analysis using a parallel ensemble clustering architecture and an unsupervised feature selection approach
Abstract
Ensemble clustering is known as a challenging research direction in data mining. The results of several individual clustering methods are combined to produce higher quality final clusters. This study introduces a parallel hierarchical clustering approach based on the divide-and-conquer strategy, which is an attempt to realize faster and more efficient ensemble clustering. Here, we propose a cluster consensus selection approach that selects a subset of meriting primary clusters to participate in the final consensus. Considering the sample-cluster and cluster–cluster similarity on the selected primary clusters, we form the final clusters based on the clusters clustering technique as a consensus function. In addition, the proposed scheme is equipped with an unsupervised feature selection approach to remove features that do not contribute significantly to clustering. Extensive evaluations have been performed on datasets of different dimensions from the University of California Irvine (UCI) machine learning repository. The simulation results guarantee the efficiency of the proposed scheme and improves the average performance between 6% and 24% compared to the state-of-the-art clustering methods.