IEEE Access (Jan 2020)

Dynamic Traffic Control of Staging Traffic on the Interconnect of the HPC Cluster System

  • Arata Endo,
  • Hiroki Ohtsuji,
  • Erika Hayashi,
  • Eiji Yoshida,
  • Chunghan Lee,
  • Susumu Date,
  • Shinji Shimojo

DOI
https://doi.org/10.1109/ACCESS.2020.3035158
Journal volume & issue
Vol. 8
pp. 198518 – 198531

Abstract

Read online

High-performance computing (HPC) cluster systems sometimes adopt a two-layered file system composed of local and global file systems to achieve both capacity and performance in storage. In such a cluster system, the input data of an application needs to be staged from the global storage into the local storage, and the output data needs to be staged from the local storage out to the global storage. This staging operation must be efficiently and quickly performed to gain higher job throughput because an inefficient staging operation prevents waiting job requests from being executed. In particular, in the case of the cluster system with the oversubscribed interconnect shared by the storage and the computing nodes, the inter-node communication and this staging operation traffic collides, which may degrade the job throughput. In this research, we focus on the traffic collision of the inter-node communication and the staging traffic to improve job throughput, targeting the cluster system with the oversubscribed interconnect where these two types of traffic flow. In other words, whether the dynamic control of the traffic flow derived from the staging operation leads to the improvement in the job throughput or not is investigated. For the investigation, we present a traffic collision avoidance method to dynamically configure a set of data paths for each type of the traffic only while the staging operation is conducted. The evaluation in this article shows that the proposed method avoids a traffic collision and accelerates the staging operation by 22.0% on our cluster system. Also, this evaluation indicates the overhead of the application incurred by the proposed method is negligible. Furthermore, 8.7% of the job execution time is reduced by the proposed method.

Keywords