Experimental Analysis in Hadoop MapReduce: A Closer Look at Fault Detection and Recovery Techniques

Muntadher Saadoon; Siti Hafizah Ab Hamid; Hazrina Sofian; Hamza Altarturi; Nur Nasuha; Zati Hakim Azizul; Asmiza Abdul Sani; Adeleh Asemi

doi:10.3390/s21113799

Sensors (May 2021)

Experimental Analysis in Hadoop MapReduce: A Closer Look at Fault Detection and Recovery Techniques

Muntadher Saadoon,
Siti Hafizah Ab Hamid,
Hazrina Sofian,
Hamza Altarturi,
Nur Nasuha,
Zati Hakim Azizul,
Asmiza Abdul Sani,
Adeleh Asemi

Affiliations

Muntadher Saadoon: Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia
Siti Hafizah Ab Hamid: Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia
Hazrina Sofian: Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia
Hamza Altarturi: Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia
Nur Nasuha: Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia
Zati Hakim Azizul: Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia
Asmiza Abdul Sani: Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia
Adeleh Asemi: Department of Software Engineering, Faculty of Computer Science and Information Technology, University Malaya, Kuala Lumpur 50603, Malaysia

DOI: https://doi.org/10.3390/s21113799
Journal volume & issue: Vol. 21, no. 11
p. 3799

Abstract

Read online

Hadoop MapReduce reactively detects and recovers faults after they occur based on the static heartbeat detection and the re-execution from scratch techniques. However, these techniques lead to excessive response time penalties and inefficient resource consumption during detection and recovery. Existing fault-tolerance solutions intend to mitigate the limitations without considering critical conditions such as fail-slow faults, the impact of faults at various infrastructure levels and the relationship between the detection and recovery stages. This paper analyses the response time under two main conditions: fail-stop and fail-slow, when they manifest with node, service, and the task at runtime. In addition, we focus on the relationship between the time for detecting and recovering faults. The experimental analysis is conducted on a real Hadoop cluster comprising MapReduce, YARN and HDFS frameworks. Our analysis shows that the recovery of a single fault leads to an average of 67.6% response time penalty. Even though the detection and recovery times are well-turned, data locality and resource availability must also be considered to obtain the optimum tolerance time and the lowest penalties.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords