IEEE Access (Jan 2022)

Clustering Big Data Based on Distributed Fuzzy K-Medoids: An Application to Geospatial Informatics

  • Magda M. Madbouly,
  • Saad M. Darwish,
  • Noha A. Bagi,
  • Mohamed A. Osman

DOI
https://doi.org/10.1109/ACCESS.2022.3149548
Journal volume & issue
Vol. 10
pp. 20926 – 20936

Abstract

Read online

The advent of big data related to spatial position knowledge, called geospatial big data, provides us with opportunities to recognize the urban environment. Existing database processing methods are inadequate to rapidly provide reliable results in a geospatial big data context due to the need for defining approximation “measures” and the increasing execution time for the queries. The clustering method yields the functional effects. How to scale and accelerate clustering algorithms while maintaining high clustering efficiency, on the other hand, remains a significant challenge. The paper’s primary contribution is the introduction of a modified hierarchical distributed k-medoid clustering method that is specific to spatial query analysis for big data. To improve the efficiency of the k-medoid algorithm and obtain more precise clusters, the suggested model utilizes the Fuzzy k-Medoids method to overcome outliers in the spatial data set and to deal with data uncertainty. The method is complex in nature since it is not predicated on the number of right clusters. The proposed model is divided into two phases: the first step creates local clusters based on a portion of the entire dataset; this stage makes extensive use of the parallelism paradigm provided by the Apache Spark framework; and the second phase aggregates the local clusters to produce compact and reliable final clusters. The proposed model greatly reduces the amount of knowledge shared during the aggregation process and automatically produces the appropriate number of clusters based on the dataset characteristics. The results show that the proposed model outperforms the traditional K-medoids in terms of accuracy of obtained centers in big data applications.

Keywords