Journal of Algorithms & Computational Technology (Jun 2010)

Implementing and Optimizing Multiple Group by Query in a MapReduce Approach

  • Jie Pan,
  • Frédéric Magoulès,
  • Yann Le Biannic

DOI
https://doi.org/10.1260/1748-3018.4.2.183
Journal volume & issue
Vol. 4

Abstract

Read online

MapReduce model is a new parallel programming model initially developed for large-scale web content processing. Data analysis meets the issue of how to do calculation over extremely large dataset. The arrival of MapReduce provides a chance to utilize commodity hardware for massively parallel data analysis applications. The translation and optimization from relational algebra operators to MapReduce programs is still an open and dynamic research field. In this paper, we focus on a special type of data analysis query, namely, multiple group by query. We first study the communication cost of MapReduce model, then we give an initial implementation of multiple group by query. We then propose an optimized version which addresses and improves the communication cost issues. Our optimized version shows a better accelerating ability and a better scalability than the other version.