Fast DRL-based scheduler configuration tuning for reducing tail latency in edge-cloud jobs

Shilin Wen; Rui Han; Chi Harold Liu; Lydia Y. Chen

doi:10.1186/s13677-023-00465-z

Journal of Cloud Computing: Advances, Systems and Applications (Jun 2023)

Fast DRL-based scheduler configuration tuning for reducing tail latency in edge-cloud jobs

Shilin Wen,
Rui Han,
Chi Harold Liu,
Lydia Y. Chen

Affiliations

Shilin Wen: School of Computer Science and Technology, Beijing Institute of Technology
Rui Han: School of Computer Science and Technology, Beijing Institute of Technology
Chi Harold Liu: School of Computer Science and Technology, Beijing Institute of Technology
Lydia Y. Chen: TU Delft

DOI: https://doi.org/10.1186/s13677-023-00465-z
Journal volume & issue: Vol. 12, no. 1
pp. 1 – 32

Abstract

Read online

Abstract Edge-cloud applications are rapidly prevailing in recent years and pose the challenge of using both resource-strenuous edge devices and elastic cloud resources under dynamic workloads. Efficient resource allocation on edge-cloud jobs via cluster schedulers (e.g. Kubernetes/Volcano scheduler) is essential to guarantee their performance, e.g. tail latency, and such allocation is sensitive to scheduler configurations such as applied scheduling algorithms and task restart/discard policy. Deep reinforcement learning (DRL) is increasingly applied to optimize scheduling decisions. However, DRL faces the conundrum of achieving high rewards at a dauntingly long training time (e.g. hours or days), making it difficult to tune the scheduler configurations online in accordance to dynamically changing edge-cloud workloads and resources. For such an issue, this paper proposes EdgeTuner, a fast scheduler configuration tuning approach that efficiently leverages DRL to reduce tail latency of edge-cloud jobs. The enabling feature of EdgeTuner is to effectively simulate the execution of edge-cloud jobs under different scheduler configurations and thus quickly estimate these configurations’ influence on job performance. The simulation results allow EdgeTuner to timely train a DRL agent in order to properly tune scheduler configurations in dynamic edge-cloud environment. We implement EdgeTuner in both Kubernetes and Volcano schedulers and extensively evaluate it on real workloads driven by Alibaba production traces. Our results show that EdgeTuner outperforms prevailing scheduling algorithms by achieving much lower tail latency while accelerating DRL training speed by an average of 151.63x.

Published in Journal of Cloud Computing: Advances, Systems and Applications

ISSN: 2192-113X (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofcloudcomputing.springeropen.com

About the journal

Abstract

Keywords