Applied Sciences (Jul 2021)

Improvements to Supercomputing Service Availability Based on Data Analysis

  • Jae-Kook Lee,
  • Min-Woo Kwon,
  • Do-Sik An,
  • Junweon Yoon,
  • Taeyoung Hong,
  • Joon Woo,
  • Sung-Jun Kim,
  • Guohua Li

DOI
https://doi.org/10.3390/app11136166
Journal volume & issue
Vol. 11, no. 13
p. 6166

Abstract

Read online

As the demand for high-performance computing (HPC) resources has increased in the field of computational science, an inevitable consideration is service availability in large cluster systems such as supercomputers. In particular, the factor that most affects availability in supercomputing services is the job scheduler utilized for allocating resources. Consequent to submitting user data through the job scheduler for data analysis, 25.6% of jobs failed because of program errors, scheduler errors, or I/O errors. Based on this analysis, we propose a K-hook method for scheduling to increase the success rate of job submissions and improve the availability of supercomputing services. By applying this method, the job-submission success rate was improved by 15% without negatively affecting users’ waiting time. We also achieved a mean time between interrupts (MTBI) of 24.3 days and maintained average system availability at 97%. As this research was verified on the Nurion supercomputer in a real service environment, the value of the research is expected to be found in significant service improvements.

Keywords