Guangtongxin yanjiu (Oct 2024)
Application of Reconfigurable OCS Technology for Pre-training Large Language Models
Abstract
【Objective】Compared to Electronic Packet Switching (EPS), Optical Circuit Switching (OCS) demonstrates advantages in latency, power consumption, cost, and stability. This study aims to explore feasible applications of OCS in the networking of training tasks by analyzing parallel partitioning strategies, collective communication requirements, traffic patterns, and current network architectures in large model pretraining, in order to fully leverage the benefits of OCS.【Methods】We propose a mechanism for network device redundancy protection using multiple small-port OCS devices, enabling rapid switching without interrupting training tasks in the event of Top-of-Rack (ToR) switch failures. Additionally, we advocate for the exclusive service of OCS to data parallelism, requiring configuration only at the start of the task.【Results】We present several feasible opto-electronic networking architectures and specific configurations under different AllReduce algorithms, including joint optimization of collective communication algorithms and architectural design to achieve optimal bandwidth.【Conclusion】By adequately integrating the traffic models of training tasks, OCS can seamlessly blend into existing EPS network architectures and optimize the large model pretraining from multiple perspectives, including cost, low power consumption, reduced latency, and enhanced stability.