Journal of Algorithms & Computational Technology (Jun 2012)
MapReduce-Based Parallel Algorithms for Multidimensionnal Data Analysis
Abstract
MapReduce has excellent scalability and fault-tolerance mechanism. It fits well with the cheap commodity hardware. Today, using MapReduce to answer data analytical query is an attractive topic. In this work, we introduce Multiple Group-by query processing. Our processing of this query is based on MapReduce model, a new parallel computing model coming from Cloud Computing. A pre-processing phase is performed for fitting MapReduce's data accessing and improving data accessibility. We give different MapReduce job definitions in order to process data set partitioned in different partitioning methods. We evaluate our query's processing on top of a cluster of Grid'5000. We also address performance issues since they are very important in software industry to integrate a new technology. We analyze the measured results and discover several factors which impact the response time. At the end of this work, we propose a new data structure which allows more flexible job-scheduling.