APL Photonics (Mar 2022)

Distributed deep learning training using silicon photonic switched architectures

  • Ziyi Zhu,
  • Min Yee Teh,
  • Zhenguo Wu,
  • Madeleine Strom Glick,
  • Shijia Yan,
  • Maarten Hattink,
  • Keren Bergman

DOI
https://doi.org/10.1063/5.0070711
Journal volume & issue
Vol. 7, no. 3
pp. 030901 – 030901-11

Abstract

Read online

The scaling trends of deep learning models and distributed training workloads are challenging network capacities in today’s datacenters and high-performance computing (HPC) systems. We propose a system architecture that leverages silicon photonic (SiP) switch-enabled server regrouping using bandwidth steering to tackle the challenges and accelerate distributed deep learning training. In addition, our proposed system architecture utilizes a highly integrated operating system-based SiP switch control scheme to reduce implementation complexity. To demonstrate the feasibility of our proposal, we built an experimental testbed with a SiP switch-enabled reconfigurable fat tree topology and evaluated the network performance of distributed ring all-reduce and parameter server workloads. The experimental results show up to 3.6× improvements over the static non-reconfigurable fat tree. Our large-scale simulation results show that server regrouping can deliver up to 2.3× flow throughput improvement for a 2× tapered fat tree and a further 11% improvement when higher-layer bandwidth steering is employed. The collective results show the potential of integrating SiP switches into datacenters and HPC systems to accelerate distributed deep learning training.