Distributed fuzzy clustering algorithm for mixed-mode data in Apache SPARK

Abdul Wahab Akram; Zareen Alamgir

doi:10.1186/s40537-022-00671-7

Journal of Big Data (Dec 2022)

Distributed fuzzy clustering algorithm for mixed-mode data in Apache SPARK

Abdul Wahab Akram,
Zareen Alamgir

Affiliations

Abdul Wahab Akram: National University of Computer and Emerging Sciences
Zareen Alamgir: National University of Computer and Emerging Sciences

DOI: https://doi.org/10.1186/s40537-022-00671-7
Journal volume & issue: Vol. 9, no. 1
pp. 1 – 18

Abstract

Read online

Abstract Fuzzy clustering is an invaluable data mining technique that allows each data point to belong to more than one cluster with some degree of membership. It is widely employed in exploratory data mining to discover overlapping communities in social networks, find structure in spectral data, and capture user interests in recommendation systems. Nowadays, the variety and volume of data are increasing at a tremendous rate. Data is power; the massive data, along with an effective technique, can unravel valuable information. The existing fuzzy clustering algorithms do not perform well on massive heterogeneous datasets. Processing an enormous amount of data is beyond the capacity of a single processor. The need of the hour is to develop fuzzy clustering techniques that can work on a distributed framework for Big Data processing and can handle heterogeneous data. In this research, we evaluate the performance of the recently proposed algorithm for the Fuzzy clustering of mixed-mode data FCMD-MD (D’Urso and Massari in Inf Sci 505:513–534, 2019) with different real-world datasets. We develop a distributed FCMD-MD, a fuzzy clustering algorithm for mixed-mode data in Apache SPARK. The experimental results show that the algorithm is scalable, performs well in a distributed environment, and clusters enormous heterogeneous data with high accuracy. We also compared the performance of distributed FCMD-MD and the distributed k-medoid algorithm.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal