Journal of King Saud University: Computer and Information Sciences (Mar 2024)
IDaPS — Improved data-locality aware data placement strategy based on Markov clustering to enhance MapReduce performance on Hadoop
Abstract
The execution of Map-Reduce applications on the Hadoop cluster poses significant challenges due to the non-consideration of data locality, i.e., assigning tasks to compute nodes where input data sets are located. Due to such non-consideration, high data transfer overheads are caused. Further, it increases latency, which may arise if input data needs to be transferred across the network, thereby significantly increasing execution time. To address this issue, an Improved DAta Placement Strategy IDaPS based on the intra-dependency among the data is proposed. IDaPS re-organizes the default data layouts in HDFS to ensure higher degree of parallelism. The efficiency of IDaPS is demonstrated in Hadoop clusters (10 and 15 nodes) by executing Hadoop Benchmark performance tests viz. WordCount, Grep on Project-Gutenberg book dataset (50 GB) and Least Square Linear Regression (LSLR) on weather dataset (10.67 GB). The results were compared with state-of-the-art algorithms viz. Hadoop Default Data Placement (HDDP), Load-Balancer and literary work RENDA. The results demonstrate that IDaPS significantly reduces execution time by 28.2% and 38.4% in 10-node and 15-node clusters while executing WordCount, and 35% and 38.1% in 10-node and 15-node clusters for Grep. Similarly, for LSLR, it reduces execution time by 32.7%.