Distributed deep learning training using silicon photonic switched architectures

Ziyi Zhu; Min Yee Teh; Zhenguo Wu; Madeleine Strom Glick; Shijia Yan; Maarten Hattink; Keren Bergman

doi:10.1063/5.0070711

APL Photonics (Mar 2022)

Distributed deep learning training using silicon photonic switched architectures

Ziyi Zhu,
Min Yee Teh,
Zhenguo Wu,
Madeleine Strom Glick,
Shijia Yan,
Maarten Hattink,
Keren Bergman

Affiliations

Ziyi Zhu: Department of Electrical Engineering, Columbia University, New York, New York 10027, USA
Min Yee Teh: Department of Electrical Engineering, Columbia University, New York, New York 10027, USA
Zhenguo Wu: Department of Electrical Engineering, Columbia University, New York, New York 10027, USA
Madeleine Strom Glick: Department of Electrical Engineering, Columbia University, New York, New York 10027, USA
Shijia Yan: Department of Electrical Engineering, Columbia University, New York, New York 10027, USA
Maarten Hattink: Department of Electrical Engineering, Columbia University, New York, New York 10027, USA
Keren Bergman: Department of Electrical Engineering, Columbia University, New York, New York 10027, USA

DOI: https://doi.org/10.1063/5.0070711
Journal volume & issue: Vol. 7, no. 3
pp. 030901 – 030901-11

Abstract

Read online

The scaling trends of deep learning models and distributed training workloads are challenging network capacities in today’s datacenters and high-performance computing (HPC) systems. We propose a system architecture that leverages silicon photonic (SiP) switch-enabled server regrouping using bandwidth steering to tackle the challenges and accelerate distributed deep learning training. In addition, our proposed system architecture utilizes a highly integrated operating system-based SiP switch control scheme to reduce implementation complexity. To demonstrate the feasibility of our proposal, we built an experimental testbed with a SiP switch-enabled reconfigurable fat tree topology and evaluated the network performance of distributed ring all-reduce and parameter server workloads. The experimental results show up to 3.6× improvements over the static non-reconfigurable fat tree. Our large-scale simulation results show that server regrouping can deliver up to 2.3× flow throughput improvement for a 2× tapered fat tree and a further 11% improvement when higher-layer bandwidth steering is employed. The collective results show the potential of integrating SiP switches into datacenters and HPC systems to accelerate distributed deep learning training.

Published in APL Photonics

ISSN: 2378-0967 (Online)
Publisher: AIP Publishing LLC
Country of publisher: United States
LCC subjects: Technology: Engineering (General). Civil engineering (General): Applied optics. Photonics
Website: https://aplphotonics.aip.org

About the journal