Journal of Big Data (Sep 2017)

Clustering categorical data based on the relational analysis approach and MapReduce

  • Yasmine Lamari,
  • Said Chah Slaoui

DOI
https://doi.org/10.1186/s40537-017-0090-7
Journal volume & issue
Vol. 4, no. 1
pp. 1 – 16

Abstract

Read online

Abstract The traditional methods of clustering are unable to cope with the exploding volume of data that the world is currently facing. As a solution to this problem, the research is intensified in the direction of parallel clustering methods. Although there is a variety of parallel programming models, the MapReduce paradigm is considered as the most prominent model for problems of large scale data processing of which the clustering. This paper introduces a new parallel design of a recently appeared heuristic for hard clustering using the MapReduce programming model. In this heuristic, clustering is performed by efficiently partitioning categorical large data sets according to the relational analysis approach. The proposed design, called PMR-Transitive, is a single-scan and parameter-free heuristic which determines the number of clusters automatically. The experimental results on real-life and synthetic data sets demonstrate that PMR-Transitive produces good quality results.

Keywords