A bidirectional DNN partition mechanism for efficient pipeline parallel training in cloud

Lingyun Cui; Zhihao Qu; Guomin Zhang; Bin Tang; Baoliu Ye

doi:10.1186/s13677-022-00382-7

Journal of Cloud Computing: Advances, Systems and Applications (Feb 2023)

A bidirectional DNN partition mechanism for efficient pipeline parallel training in cloud

Lingyun Cui,
Zhihao Qu,
Guomin Zhang,
Bin Tang,
Baoliu Ye

Affiliations

Lingyun Cui: Key Laboratory of Water Resources Big Data Technology of Ministry of Water Resources and School of Computer and Information, Hohai University
Zhihao Qu: Key Laboratory of Water Resources Big Data Technology of Ministry of Water Resources and School of Computer and Information, Hohai University
Guomin Zhang: Army Engineering University
Bin Tang: Key Laboratory of Water Resources Big Data Technology of Ministry of Water Resources and School of Computer and Information, Hohai University
Baoliu Ye: State Key Laboratory for Novel Software Technology, Nanjing University

DOI: https://doi.org/10.1186/s13677-022-00382-7
Journal volume & issue: Vol. 12, no. 1
pp. 1 – 12

Abstract

Read online

Abstract Recently, deep neural networks (DNNs) have shown great promise in many fields while their parameter sizes are rapidly expanding. To break through the computation and memory limitation of a single machine, pipeline model parallelism is proposed for large-scale DNN training by fully utilizing the computation and storage power of the distributed cluster. Cloud data centers can also provide sufficient computing, storage and bandwidth resources. However, most existing approaches apply layer-wise partitioning, which is difficult to obtain an even model partition result because of the large computational overhead discrepancy between DNN layers, resulting in degraded efficiency. To tackle this issue, we propose “Bi-Partition”, a novel partitioning method based on bidirectional partitioning for forward propagation (FP) and backward propagation (BP), which improves the efficiency of the pipeline model parallelism system. By deliberated designing distinct cut positions for FP and BP of DNN training, workers in the pipeline get nearly equal computational loads, and the balanced pipeline fully utilizes the computing resources. Experiments on various DNN models and datasets validate the efficiency of our mechanism, e.g., the training efficiency achieving up to 1.9 $$\times$$ × faster than the state-of-the-art method PipeDream.

Published in Journal of Cloud Computing: Advances, Systems and Applications

ISSN: 2192-113X (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofcloudcomputing.springeropen.com

About the journal

Abstract

Keywords