Collective Communication Performance Evaluation for Distributed Deep Learning Training

Sookwang Lee; Jaehwan Lee

doi:10.3390/app14125100

Applied Sciences (Jun 2024)

Collective Communication Performance Evaluation for Distributed Deep Learning Training

Sookwang Lee,
Jaehwan Lee

Affiliations

Sookwang Lee: Supercomputing Technology Research Center, Electronics and Telecommunications Research Institute, Daejeon 34054, Republic of Korea
Jaehwan Lee: Department of Computer Engineering, Korea Aerospace University, Goyang 10540, Republic of Korea

DOI: https://doi.org/10.3390/app14125100
Journal volume & issue: Vol. 14, no. 12
p. 5100

Abstract

Read online

In distributed deep learning, the improper use of the collective communication library can lead to a decline in deep learning performance due to increased communication time. Representative collective communication libraries such as MPI, GLOO, and NCCL exhibit varying performance based on server environment and communication architecture. In this study, we investigate three key aspects to evaluate the performance of the collective communication libraries in a distributed deep learning setting in an intra-node environment. First, we conduct a comparison and analysis of collective communication library performance within common distributed deep learning architectures, such as parameter servers and ring all-reduce methods. Second, we evaluate the performance of these libraries in different environments, including various container platforms and bare metal setups, considering the scalability and flexibility advantages offered by cloud virtualization. Last, to ensure practicality, we assess the libraries’ performance in a Linux shell and within the PyTorch framework. In the cross-docker virtualization environment, NCCL shows up to 213% higher latency compared to single docker, while GLOO exhibits 36% lower latency in single docker than in cross docker, and NCCL achieves up to 345% lower execution time in all-reduce operations compared to other libraries (MPI and GLOO). These findings will inform the selection of an appropriate collective communication library for designing effective distributed deep learning environments.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords