IEEE Open Journal of the Communications Society (Jan 2024)
Efficient Algorithm for All-Gather Operation in Optical Interconnect Systems
Abstract
In the realm of parallel and distributed computation, All-gather operation, a process where each node in a distributed system gathers data from all others, is pivotal. This operation underpins various high-performance computing (HPC) applications, notably in distributed deep learning (DL), by enabling model and hybrid parallelisms. Although optical interconnection networks promise unmatched bandwidth and reliability for data transfers between distributed nodes, most current All-gather algorithms remain optimized for electrical interconnects, leading to suboptimal performance in optical contexts. This paper proposes “OpTree”, an advanced scheme distinctly designed for All-gather operation in optical interconnect systems. OpTree constructs an optimal $m$ -ary tree that minimizes communication time by determining the optimal number of communication stages. A comprehensive comparison between OpTree’s communication steps and existing All-gather algorithms is provided. Theoretical insights reveal that OpTree substantially curtails communication steps within optical interconnects. Constraints imposed by OpTree on optical communication are also elaborated. Empirical evaluations, through rigorous simulations, establish that: 1) OpTree is effective in generating an optimal m-ary tree for minimizing communication time. 2) For a 1024-node optical ring system, OpTree cuts communication time by 72.97%, 93.15%, and 86.32% against WRHT, Ring, and Neighbor Exchange (NE) schemes, respectively, tested over different message sizes. 3) With varying node counts, the reductions stand at 42.27%, 92.74%, and 85.49% against the same counterparts. 4) As the number of wavelengths increases, communication time further diminishes.
Keywords