Journal of Big Data (Dec 2022)

Distributed fuzzy clustering algorithm for mixed-mode data in Apache SPARK

  • Abdul Wahab Akram,
  • Zareen Alamgir

DOI
https://doi.org/10.1186/s40537-022-00671-7
Journal volume & issue
Vol. 9, no. 1
pp. 1 – 18

Abstract

Read online

Abstract Fuzzy clustering is an invaluable data mining technique that allows each data point to belong to more than one cluster with some degree of membership. It is widely employed in exploratory data mining to discover overlapping communities in social networks, find structure in spectral data, and capture user interests in recommendation systems. Nowadays, the variety and volume of data are increasing at a tremendous rate. Data is power; the massive data, along with an effective technique, can unravel valuable information. The existing fuzzy clustering algorithms do not perform well on massive heterogeneous datasets. Processing an enormous amount of data is beyond the capacity of a single processor. The need of the hour is to develop fuzzy clustering techniques that can work on a distributed framework for Big Data processing and can handle heterogeneous data. In this research, we evaluate the performance of the recently proposed algorithm for the Fuzzy clustering of mixed-mode data FCMD-MD (D’Urso and Massari in Inf Sci 505:513–534, 2019) with different real-world datasets. We develop a distributed FCMD-MD, a fuzzy clustering algorithm for mixed-mode data in Apache SPARK. The experimental results show that the algorithm is scalable, performs well in a distributed environment, and clusters enormous heterogeneous data with high accuracy. We also compared the performance of distributed FCMD-MD and the distributed k-medoid algorithm.