Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques

Mohammed Bergui; Soufiane Hourri; Said Najah; Nikola S. Nikolov

doi:10.1186/s40537-024-00964-z

Journal of Big Data (Jul 2024)

Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques

Mohammed Bergui,
Soufiane Hourri,
Said Najah,
Nikola S. Nikolov

Affiliations

Mohammed Bergui: Laboratory of Intelligent Systems and Applications, Department of Computer Science, Faculty of Sciences and Technologies, University of Sidi Mohammed Ben Abdellah
Soufiane Hourri: Laboratory of Intelligent Systems and Applications, Department of Computer Science, Faculty of Sciences and Technologies, University of Sidi Mohammed Ben Abdellah
Said Najah: Laboratory of Intelligent Systems and Applications, Department of Computer Science, Faculty of Sciences and Technologies, University of Sidi Mohammed Ben Abdellah
Nikola S. Nikolov: Department of Computer Science and Information Systems, University of Limerick

DOI: https://doi.org/10.1186/s40537-024-00964-z
Journal volume & issue: Vol. 11, no. 1
pp. 1 – 20

Abstract

Read online

Abstract Within the Hadoop ecosystem, MapReduce stands as a cornerstone for managing, processing, and mining large-scale datasets. Yet, the absence of efficient solutions for precise estimation of job execution times poses a persistent challenge, impacting task allocation and distribution within Hadoop clusters. In this study, we present a comprehensive machine learning approach for predicting the execution time of MapReduce jobs, encompassing data collection, preprocessing, feature engineering, and model evaluation. Leveraging a rich dataset derived from comprehensive Hadoop MapReduce job traces, we explore the intricate relationship between cluster parameters and job performance. Through a comparative analysis of machine learning models, including linear regression, decision tree, random forest, and gradient-boosted regression trees, we identify the random forest model as the most effective, demonstrating superior predictive accuracy and robustness. Our findings underscore the critical role of features such as data size and resource allocation in determining job performance. With this work, we aim to enhance resource management efficiency and enable more effective utilisation of cloud-based Hadoop clusters for large-scale data processing tasks.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords