A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data

Fei Hu; Chaowei Yang; Yongyao Jiang; Yun Li; Weiwei Song; Daniel Q. Duffy; John L. Schnase; Tsengdar Lee

doi:10.1080/17538947.2018.1523957

International Journal of Digital Earth (Mar 2020)

A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data

Fei Hu,
Chaowei Yang,
Yongyao Jiang,
Yun Li,
Weiwei Song,
Daniel Q. Duffy,
John L. Schnase,
Tsengdar Lee

Affiliations

Fei Hu: George Mason University
Chaowei Yang: George Mason University
Yongyao Jiang: George Mason University
Yun Li: George Mason University
Weiwei Song: George Mason University
Daniel Q. Duffy: Office of Computational and Information Sciences and Technology, NASA Goddard Space Flight Center
John L. Schnase: NASA Center for Climate Simulation, Goddard Space Flight Center
Tsengdar Lee: NASA Headquarters

DOI: https://doi.org/10.1080/17538947.2018.1523957
Journal volume & issue: Vol. 13, no. 3
pp. 410 – 428

Abstract

Read online

Earth observations and model simulations are generating big multidimensional array-based raster data. However, it is difficult to efficiently query these big raster data due to the inconsistency among the geospatial raster data model, distributed physical data storage model, and the data pipeline in distributed computing frameworks. To efficiently process big geospatial data, this paper proposes a three-layer hierarchical indexing strategy to optimize Apache Spark with Hadoop Distributed File System (HDFS) from the following aspects: (1) improve I/O efficiency by adopting the chunking data structure; (2) keep the workload balance and high data locality by building the global index (k-d tree); (3) enable Spark and HDFS to natively support geospatial raster data formats (e.g., HDF4, NetCDF4, GeoTiff) by building the local index (hash table); (4) index the in-memory data to further improve geospatial data queries; (5) develop a data repartition strategy to tune the query parallelism while keeping high data locality. The above strategies are implemented by developing the customized RDDs, and evaluated by comparing the performance with that of Spark SQL and SciSpark. The proposed indexing strategy can be applied to other distributed frameworks or cloud-based computing systems to natively support big geospatial data query with high efficiency.

Published in International Journal of Digital Earth

ISSN: 1753-8947 (Print); 1753-8955 (Online)
Publisher: Taylor & Francis Group
Country of publisher: United Kingdom
LCC subjects: Geography. Anthropology. Recreation: Mathematical geography. Cartography
Website: https://www.tandfonline.com/journals/tjde

About the journal

Abstract

Keywords