Xibei Gongye Daxue Xuebao (Jun 2018)

Data Balance Algorithm Based on Histogram in MapReduce

  • ,
  • ,

DOI
https://doi.org/10.1051/jnwpu/20183630480
Journal volume & issue
Vol. 36, no. 3
pp. 480 – 486

Abstract

Read online

MapReduce model is a typical distributed computing model, which is widely used in large-scale data processing, and its performance depends largely on the data distribution status. As the data content is often unbalanced, coupled with the storage of randomness, so MapReduce model prone to data skew problem in the calculation process. In order to solve this problem, this paper establishes a data histogram for the data block and the whole file through the improved parallel histogram parallelization algorithm based on MapReduce. According to the data block distribution, we can judge the data skew degree of each storage nodes and define the file equilibrium deviation value as the measure of data skew, and then the data balance algorithm is used to reduce the file equilibrium deviation value. The improved MapReduce-based data histogram parallel construction algorithm can adapt to various types of data application scenarios. In the process of building the histogram, the Map side only needs to transmit histogram statistics to the Reduce side without transmitting the contents of the file. The data transfer can be almost negligible. The data balance algorithm based on histogram employs greedy strategy, which can obtain a better approximate solution of the optimal solution of equilibrium distribution. After several experiments, compared with the random block distribution algorithm, the improved algorithm reduce about 40% of the file balance deviation value and achieves a better data balance performance.

Keywords