Journal of Big Data (May 2023)
Design and implementation of I/O performance prediction scheme on HPC systems through large-scale log analysis
Abstract
Abstract Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and storage units used by hundreds to thousands of users simultaneously. Applications from large numbers of users have diverse characteristics, such as varying computation, communication, memory, and I/O intensity. A good understanding of the performance characteristics of each user application is important for job scheduling and resource provisioning. Among these performance characteristics, I/O performance is becoming increasingly important as data sizes rapidly increase and large-scale applications, such as simulation and model training, are widely adopted. However, predicting I/O performance is difficult because I/O systems are shared among all users and involve many layers of software and hardware stack, including the application, network interconnect, operating system, file system, and storage devices. Furthermore, updates to these layers and changes in system management policy can significantly alter the I/O behavior of applications and the entire system. To improve the prediction of the I/O performance on HPC systems, we propose integrating information from several different system logs and developing a regression-based approach to predict the I/O performance. Our proposed scheme can dynamically select the most relevant features from the log entries using various feature selection algorithms and scoring functions, and can automatically select the regression algorithm with the best accuracy for the prediction task. The evaluation results show that our proposed scheme can predict the write performance with up to 90% prediction accuracy and the read performance with up to 99% prediction accuracy using the real logs from the Cori supercomputer system at NERSC.
Keywords