Applied Sciences (Sep 2018)

Integrated High-Performance Platform for Fast Query Response in Big Data with Hive, Impala, and SparkSQL: A Performance Evaluation

  • Bao Rong Chang,
  • Hsiu-Fen Tsai,
  • Yun-Da Lee

DOI
https://doi.org/10.3390/app8091514
Journal volume & issue
Vol. 8, no. 9
p. 1514

Abstract

Read online

This paper first integrates big data tools—Hive, Impala, and SparkSQL—which support SQL-like queries for rapid data retrieval in big data. The three introduced tools are not only suitable for operating in business intelligence to serve high-performance data retrieval, but they are also an open-source software solution with low cost for small-to-medium enterprise use. In practice, the proposed approach provides an in-memory cache and an in-disk cache to achieve a very fast response to a query if a cache hit occurs. Moreover, this paper develops so-called platform selection that is able to select the appropriate tool dealing with input query with effectiveness and efficiency. As a result, the speed of job execution of proposed approach using platform selection is 2.63 times faster than Hive in the Case 1 experiment, and 4.57 times faster in the Case 2 experiment.

Keywords