Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark

Panagiotis Moutafis; George Mavrommatis; Michael Vassilakopoulos; Antonio Corral

doi:10.3390/ijgi10110763

ISPRS International Journal of Geo-Information (Nov 2021)

Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark

Panagiotis Moutafis,
George Mavrommatis,
Michael Vassilakopoulos,
Antonio Corral

Affiliations

Panagiotis Moutafis: Data Structuring & Engineering Lab, Department of Electrical and Computer Engineering, University of Thessaly, 38221 Volos, Greece
George Mavrommatis: Data Structuring & Engineering Lab, Department of Electrical and Computer Engineering, University of Thessaly, 38221 Volos, Greece
Michael Vassilakopoulos: Data Structuring & Engineering Lab, Department of Electrical and Computer Engineering, University of Thessaly, 38221 Volos, Greece
Antonio Corral: Department of Informatics, University of Almeria, 04120 Almeria, Spain

DOI: https://doi.org/10.3390/ijgi10110763
Journal volume & issue: Vol. 10, no. 11
p. 763

Abstract

Read online

Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group K nearest-neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed GKNN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark GKNN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop.

Published in ISPRS International Journal of Geo-Information

ISSN: 2220-9964 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Geography. Anthropology. Recreation: Geography (General)
Website: http://www.mdpi.com/journal/ijgi

About the journal

Abstract

Keywords