Optimization of High-Performance Computing Job Scheduling Based on Offline Reinforcement Learning

Shihao Li; Wei Dai; Yongyan Chen; Bo Liang

doi:10.3390/app142311220

Applied Sciences (Dec 2024)

Optimization of High-Performance Computing Job Scheduling Based on Offline Reinforcement Learning

Shihao Li,
Wei Dai,
Yongyan Chen,
Bo Liang

Affiliations

Shihao Li: Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Wei Dai: Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Yongyan Chen: Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
Bo Liang: Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China

DOI: https://doi.org/10.3390/app142311220
Journal volume & issue: Vol. 14, no. 23
p. 11220

Abstract

Read online

In large-scale, distributed high-performance computing systems, the increasing complexity of job scheduling has expanded along with the growth of computational resources and job diversity. While heuristic scheduling strategies with various optimization objectives have shown promising results, their effectiveness is often limited in real-world applications due to the dynamic nature of workloads and system configurations. Deep reinforcement learning (DRL) methods offer the potential to address scheduling challenges. However, their trial-and-error learning approach can lead to suboptimal performance or resource wastage in the early stages. To mitigate these risks, this paper introduces an offline reinforcement learning-based job scheduling method. By training on historical data, the method avoids the pitfalls of deploying immature strategies in live environments. We constructed an offline dataset by combining expert scheduling trajectories with early-stage trial data from online reinforcement learning. This enables the development of more robust scheduling policies. Experimental results demonstrate that, compared to heuristic and online DRL algorithms, the proposed approach achieves more efficient scheduling performance across various workloads and optimization goals, showcasing its practicality and broad applicability.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords