Современные информационные технологии и IT-образование (Dec 2021)
Evaluation of the Temporal Efficiency of Big Data Storage Formats in the Dynamics of Data Growth
Abstract
When developing a data lake on platforms such as Apache Hadoop, the choice of data storage format becomes an important issue. This choice should be based on a number of different criteria, one of which is the time it takes to run different queries on this data. However, any data processing system assumes a constant growth in the volume of this data. In this regard, it becomes necessary to study the effectiveness of formats in the dynamics of growth in the amount of data stored in the system. This article proposes a methodology for assessing the effectiveness of data storage formats in data lakes built on the Apache Hadoop platform in the dynamics of data growth. An experiment is proposed, which is a series of queries of varying complexity to data stored in JSON, Apache Avro, ORC, Apache Parquet formats. The Apache Spark framework was used to run queries.
Keywords