Machine learning regression to boost scheduling performance in hyper-scale cloud-computing data centres

Damián Fernández-Cerero; José A. Troyano; Agnieszka Jakóbik; Alejandro Fernández-Montes

Journal of King Saud University: Computer and Information Sciences (Jun 2022)

Machine learning regression to boost scheduling performance in hyper-scale cloud-computing data centres

Damián Fernández-Cerero,
José A. Troyano,
Agnieszka Jakóbik,
Alejandro Fernández-Montes

Affiliations

Damián Fernández-Cerero: Department of Computer Languages and Systems, University of Seville, Avda. Reina Mercedes s/n., 41012 Seville, Spain; Corresponding author.
José A. Troyano: Department of Computer Languages and Systems, University of Seville, Avda. Reina Mercedes s/n., 41012 Seville, Spain
Agnieszka Jakóbik: Department of Computer Science, Cracow University of Technology, Cracow, Poland
Alejandro Fernández-Montes: Department of Computer Languages and Systems, University of Seville, Avda. Reina Mercedes s/n., 41012 Seville, Spain

Journal volume & issue: Vol. 34, no. 6
pp. 3191 – 3203

Abstract

Read online

Data centres increase their size and complexity due to the increasing amount of heterogeneous workloads and patterns to be served. Such a mix of various purpose workloads makes the optimisation of resource management systems according to temporal or application-level patterns difficult. Data-centre operators have developed multiple resource-management models to improve scheduling performance in controlled scenarios. However, the constant evolution of the workloads makes the utilisation of only one resource-management model sub-optimal in some scenarios.In this work, we propose: (a) a machine learning regression model based on gradient boosting to predict the time a resource manager needs to schedule incoming jobs for a given period; and (b) a resource management model, Boost, that takes advantage of this regression model to predict the scheduling time of a catalogue of resource managers so that the most performant can be used for a time span.The benefits of the proposed resource-management model are analysed by comparing its scheduling performance KPIs to those provided by the two most popular resource-management models: two-level, used by Apache Mesos, and shared-state, employed by Google Borg. Such gains are empirically evaluated by simulating a hyper-scale data centre that executes a realistic synthetically generated workload that follows real-world trace patterns.

Published in Journal of King Saud University: Computer and Information Sciences

ISSN: 1319-1578 (Print)
Publisher: Elsevier
Country of publisher: Saudi Arabia
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.journals.elsevier.com/journal-of-king-saud-university-computer-and-information-sciences/

About the journal

Abstract

Keywords