Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

Eduarda Costa; Carlos Costa; Maribel Yasmina Santos

doi:10.1186/s40537-019-0196-1

Journal of Big Data (May 2019)

Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems

Eduarda Costa,
Carlos Costa,
Maribel Yasmina Santos

Affiliations

Eduarda Costa: ALGORITMI Research Centre, University of Minho
Carlos Costa: ALGORITMI Research Centre, University of Minho
Maribel Yasmina Santos: ALGORITMI Research Centre, University of Minho

DOI: https://doi.org/10.1186/s40537-019-0196-1
Journal volume & issue: Vol. 6, no. 1
pp. 1 – 38

Abstract

Read online

Abstract Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords