IEEE Access (Jan 2024)
Efficient Training of Large-Scale Neural Networks Using Linear Pipeline Broadcast
Abstract
Recently, the adoption of deep learning models in several domains and for various tasks has increased, correspondingly amplifying the number of model layers and parameters needed to achieve the required performance. Accordingly, the amount of memory required for model training has increased, advancing the adoption and exploration of distributed training. Generally, model parallelism techniques require a large amount of memory for training during distributed training. Among them, layer pipelining, which involves dividing the model into layers and configuring the stages on the devices, has attracted interest. Activation recomputation is a popular method for efficiently utilizing pipeline parallelism while minimizing memory consumption. However, it can lead to a decrease in training throughput due to redundant operations. Therefore, this study introduces a forward propagation technique that employs a linear pipeline broadcast method to decrease memory consumption while mitigating training throughput reduction by partially integrating recomputation in PipeDream-Flush. The proposed broadcast-based forward propagation offsets the overhead caused by activation recomputation by optimizing network communication between pipeline stages and reducing bubbles in the warm-up phase of the pipeline. Experimental results demonstrate that the proposed technique reduces memory consumption by approximately 36.0% at peak training throughput for GPT2 than PipeDream-Flush, without a significant decrease in training throughput. Compared with that for PipeDream-Flush, the proposed method achieved peak training throughputs of 14.6% and 12.6% higher for the ResNet152 and VGG19 models, respectively, while consuming 30.1% and 12.0% lesser memory.
Keywords