TensorLightning: A Traffic-Efficient Distributed Deep Learning on Commodity Spark Clusters

Seil Lee; Hanjoo Kim; Jaehong Park; Jaehee Jang; Chang-Sung Jeong; Sungroh Yoon

doi:10.1109/ACCESS.2018.2842103

IEEE Access (Jan 2018)

TensorLightning: A Traffic-Efficient Distributed Deep Learning on Commodity Spark Clusters

Seil Lee,
Hanjoo Kim,
Jaehong Park,
Jaehee Jang,
Chang-Sung Jeong,
Sungroh Yoon

Affiliations

Seil Lee: Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea
Hanjoo Kim: Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea
Jaehong Park: Element AI, Montreal, QC, Canada
Jaehee Jang: Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea
Chang-Sung Jeong: Department of Electrical Engineering, Korea University, Seoul, South Korea
Sungroh Yoon: ORCiD; Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea

DOI: https://doi.org/10.1109/ACCESS.2018.2842103
Journal volume & issue: Vol. 6
pp. 27671 – 27680

Abstract

Read online

With the recent success of deep learning, the amount of data and computation continues to grow daily. Hence a distributed deep learning system that shares the training workload has been researched extensively. Although a scale-out distributed environment using commodity servers is widely used, not only is there a limit due to synchronous operation and communication traffic but also combining deep neural network (DNN) training with existing clusters often demands additional hardware and migration between different cluster frameworks or libraries, which is highly inefficient. Therefore, we propose TensorLightning which integrates the widely used data pipeline of Apache Spark with powerful deep learning libraries, Caffe and TensorFlow. TensorLightning embraces a brand-new parameter aggregation algorithm and parallel asynchronous parameter managing schemes to relieve communication discrepancies and overhead. We redesign the elastic averaging stochastic gradient descent algorithm with pruned and sparse form parameters. Our approach provides the fast and flexible DNN training with high accessibility. We evaluated our proposed framework with convolutional neural network and recurrent neural network models; the framework reduces network traffic by 67% with faster convergence.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords