IEEE Access (Jan 2018)
Efficient Scheduling in Training Deep Convolutional Networks at Large Scale
Abstract
The deep convolutional network is one of the most successful machine learning models in recent years. However, training large deep networks is a time consuming process. Due to a large number of parameters in these networks, the efficiency of data parallel methods is usually limited by the communication speed of networks. In this paper, we introduce two new algorithms to speedup training large deep networks with multiple machines: (1) propose a new scheduling algorithm to reduce communication delay in gradient transmission and (2) present a new collective algorithm based on reverse-reduce tree to reduce link contentions. We implement our algorithms on a well-known library Caffe and obtain near linearly scaling performance on commodity Ethernet networks.
Keywords