IEEE Access (Jan 2023)

Hierarchical Semi-Sparse Cubes—Parallel Framework for Storing Multi-Modal Big Data in HDF5

  • Jiri Nadvornik,
  • Petr Skoda,
  • Pavel Tvrdik

DOI
https://doi.org/10.1109/ACCESS.2023.3323897
Journal volume & issue
Vol. 11
pp. 119876 – 119897

Abstract

Read online

Since Moore‘s law applies also to data detectors, the volume of data collected in astronomy doubles approximately every year. A prime example is the upcoming Square Kilometer Array (SKA) instrument that will produce approximately 8.5 Exabytes over the first 15 years of service, starting in the year 2027. Storage capacities for these data have grown as well, and primary analytical tools have also kept up. However, the tools for combining big data from several such instruments still lag behind. Having the ability to easily combine big data is crucial for inferring new knowledge about the universe from the correlations and not only finding interesting information in these huge datasets but also their combinations. In this article, we present a revised version of the Hierarchical Semi-Sparse Cube (HiSS-Cube) framework. It aims to provide highly parallel processing of combined multi-modal multi-dimensional big data. The main contributions of this study are as follows: 1) Highly parallel construction of a database built on top of the HDF5 framework. This database supports parallel queries; 2) design of a database index on top of HDF5 that can be easily constructed in parallel; 3) support of efficient multi-modal big data combinations. We tested the scalability and efficiency on big astronomical spectroscopic and photometric data obtained from the Sloan Digital Sky Survey. The performance of HiSS-Cube is bounded by the I/O bandwidth and I/O operations per second of the underlying parallel file system, and it scales linearly with the number of I/O nodes.

Keywords