A Scalable Computing Resources System for Remote Sensing Big Data Processing Using GeoPySpark Based on Spark on K8s

Jifu Guo; Chunlin Huang; Jinliang Hou

doi:10.3390/rs14030521

Remote Sensing (Jan 2022)

A Scalable Computing Resources System for Remote Sensing Big Data Processing Using GeoPySpark Based on Spark on K8s

Jifu Guo,
Chunlin Huang,
Jinliang Hou

Affiliations

Jifu Guo: Key Laboratory of Remote Sensing of Gansu Province, Heihe Remote Sensing Experimental Research Station, Northwest Institute of Eco-Environment and Resources, Chinese Academy of Sciences, Lanzhou 730000, China
Chunlin Huang: Key Laboratory of Remote Sensing of Gansu Province, Heihe Remote Sensing Experimental Research Station, Northwest Institute of Eco-Environment and Resources, Chinese Academy of Sciences, Lanzhou 730000, China
Jinliang Hou: Key Laboratory of Remote Sensing of Gansu Province, Heihe Remote Sensing Experimental Research Station, Northwest Institute of Eco-Environment and Resources, Chinese Academy of Sciences, Lanzhou 730000, China

DOI: https://doi.org/10.3390/rs14030521
Journal volume & issue: Vol. 14, no. 3
p. 521

Abstract

Read online

As a result of Earth observation (EO) entering the era of big data, a significant challenge relating to by the storage, analysis, and visualization of a massive amount of remote sensing (RS) data must be addressed. In this paper, we proposed a novel scalable computing resources system to achieve high-speed processing of RS big data in a parallel distributed architecture. To reduce data movement among computing nodes, the Hadoop Distributed File System (HDFS) is established on nodes of K8s, which are also used for computing. In the process of RS data analysis, we innovatively use the tile-oriented programming model instead of the traditional strip-oriented or pixel-oriented approach to better implement parallel computing in a Spark on Kubernetes (K8s) cluster. A large RS raster layer can be abstracted as a user-defined tile format of any size, so that a whole computing task can be divided into multiple distributed parallel tasks. The computing resources applied by users would be immediately assigned in the Spark on K8s cluster by simply configuring and initializing SparkContext through a web-based Jupyter notebook console. Users can easily query, write, or visualize data in any box size from the catalog module in GeoPySpark. In summary, the system proposed in this study can provide a distributed scalable resources system for assembling big data storage, parallel computing, and real-time visualization.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords