Journal of Cloud Computing: Advances, Systems and Applications (Feb 2023)

A bidirectional DNN partition mechanism for efficient pipeline parallel training in cloud

  • Lingyun Cui,
  • Zhihao Qu,
  • Guomin Zhang,
  • Bin Tang,
  • Baoliu Ye

DOI
https://doi.org/10.1186/s13677-022-00382-7
Journal volume & issue
Vol. 12, no. 1
pp. 1 – 12

Abstract

Read online

Abstract Recently, deep neural networks (DNNs) have shown great promise in many fields while their parameter sizes are rapidly expanding. To break through the computation and memory limitation of a single machine, pipeline model parallelism is proposed for large-scale DNN training by fully utilizing the computation and storage power of the distributed cluster. Cloud data centers can also provide sufficient computing, storage and bandwidth resources. However, most existing approaches apply layer-wise partitioning, which is difficult to obtain an even model partition result because of the large computational overhead discrepancy between DNN layers, resulting in degraded efficiency. To tackle this issue, we propose “Bi-Partition”, a novel partitioning method based on bidirectional partitioning for forward propagation (FP) and backward propagation (BP), which improves the efficiency of the pipeline model parallelism system. By deliberated designing distinct cut positions for FP and BP of DNN training, workers in the pipeline get nearly equal computational loads, and the balanced pipeline fully utilizes the computing resources. Experiments on various DNN models and datasets validate the efficiency of our mechanism, e.g., the training efficiency achieving up to 1.9 $$\times$$ × faster than the state-of-the-art method PipeDream.

Keywords