A Distributed Execution Pipeline for Clustering Trajectories Based on a Fuzzy Similarity Relation

Soufiane Maguerra; Azedine Boulmakoul; Lamia Karim; Hassan Badir

doi:10.3390/a12020029

Algorithms (Jan 2019)

A Distributed Execution Pipeline for Clustering Trajectories Based on a Fuzzy Similarity Relation

Soufiane Maguerra,
Azedine Boulmakoul,
Lamia Karim,
Hassan Badir

Affiliations

Soufiane Maguerra: LIM/IOS, FSTM, Hassan II University of Casablanca, Mohammedia 20000, Morocco
Azedine Boulmakoul: LIM/IOS, FSTM, Hassan II University of Casablanca, Mohammedia 20000, Morocco
Lamia Karim: National School of Applied Sciences Berrechid, Hassan 1st University, Berrechid 26002, Morocco
Hassan Badir: National School of Applied Sciences Tangier, Abdelmalek Essaâdi University, Tétouan 93000, Morocco

DOI: https://doi.org/10.3390/a12020029
Journal volume & issue: Vol. 12, no. 2
p. 29

Abstract

Read online

The proliferation of indoor and outdoor tracking devices has led to a vast amount of spatial data. Each object can be described by several trajectories that, once analysed, can yield to significant knowledge. In particular, pattern analysis by clustering generic trajectories can give insight into objects sharing the same patterns. Still, sequential clustering approaches fail to handle large volumes of data. Hence, the necessity of distributed systems to be able to infer knowledge in a trivial time interval. In this paper, we detail an efficient, scalable and distributed execution pipeline for clustering raw trajectories. The clustering is achieved via a fuzzy similarity relation obtained by the transitive closure of a proximity relation. Moreover, the pipeline is integrated in Spark, implemented in Scala and leverages the Core and Graphx libraries making use of Resilient Distributed Datasets (RDD) and graph processing. Furthermore, a new simple, but very efficient, partitioning logic has been deployed in Spark and integrated into the execution process. The objective behind this logic is to equally distribute the load among all executors by considering the complexity of the data. In particular, resolving the load balancing issue has reduced the conventional execution time in an important manner. Evaluation and performance of the whole distributed process has been analysed by handling the Geolife project’s GPS trajectory dataset.

Published in Algorithms

ISSN: 1999-4893 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.mdpi.com/journal/algorithms

About the journal

Abstract

Keywords