Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud

Kyeonglok Kim; Hyeonsu Lee; Seungmin Oh; Euiseong Seo

doi:10.1109/ACCESS.2022.3184692

IEEE Access (Jan 2022)

Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud

Kyeonglok Kim,
Hyeonsu Lee,
Seungmin Oh,
Euiseong Seo

Affiliations

Kyeonglok Kim: ORCiD; Department of Computer Science and Engineering, Sungkyunkwan University, Suwon-si, Republic of Korea
Hyeonsu Lee: Department of Computer Science and Engineering, Sungkyunkwan University, Suwon-si, Republic of Korea
Seungmin Oh: Department of Computer Science and Engineering, Sungkyunkwan University, Suwon-si, Republic of Korea
Euiseong Seo: ORCiD; Department of Computer Science and Engineering, Sungkyunkwan University, Suwon-si, Republic of Korea

DOI: https://doi.org/10.1109/ACCESS.2022.3184692
Journal volume & issue: Vol. 10
pp. 68468 – 68481

Abstract

Read online

In order to cope with the growing scale of deep neural network (DNN) models and training data, the use of cloud computing for distributed DNN training is becoming increasingly popular. The amount of available resources in a cloud continuously changes according to users’ demands. Although distributed DNN training has a long execution time ranging from several hours to several days, existing frameworks cannot provide a dynamic scale function or have high scale in/out overhead. Therefore, it is difficult to achieve higher performance by adding graphics processing unit (GPU) nodes to a running training cluster, even when surplus GPU resources become available. In addition, the inability to dynamically reconfigure the training cluster prohibits the reform of the cluster topology when it was sub-optimally created. This paper proposes a dynamic scaling technique with which the dynamic addition and removal of new workers can be performed without suspending the ongoing training job. In addition, we propose a heterogeneity-aware straggler-proof technique so that, even when the performance of the GPUs in the cloud are uneven, a performance benefit can be guaranteed through the addition of the surplus resources. The proposed scheme improved throughput by up to a factor of 17.52 during scaling out the existing cluster of five workers to ten compared to the existing checkpoint-based scheme. Furthermore, training was continued at 95.52% of the maximum performance while being stopped for 841 seconds in Elastic Horovod, which supports dynamic scaling. Finally, even when GPUs of different performances were mixed, the error between the determined batch size and the optimal batch size was 3.37% on average.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords