Dianxin kexue (Apr 2025)

Novel HIC-OTN for interconnection of cross-intelligent computing clusters

  • ZHANG Dechao,
  • SUN Jiang,
  • CAO Shan,
  • ZUO Mingqing,
  • WANG Dong,
  • LI Han

Journal volume & issue
Vol. 41
pp. 53 – 60

Abstract

Read online

With the rapid development of the global AI industry, the computational power demands of large-scale models continued to grow, prompting major technology companies worldwide to actively construct ultra-large-scale clusters exceeding 10 000 or even 100 000 GPU. Limited by natural resource supply, construction investment, and other constraints, the construction of a multi-cluster interconnected fundamental network through a high-speed all-optical network is an important potential solution for achieving efficient collaborative training across clusters. To meet the ultra-large bandwidth, ultra-low latency, and ultra-high reliability requirements of intelligent computing interconnection, a hitless intelligent computing optical transport network (HIC-OTN) and its key technological solutions were proposed. Based on HIC-OTN, the first field trial of 104 km cross-cluster pipeline parallelism (PP) training had been demonstrated, verifying the feasibility of 100 km-class cross-cluster PP training. Based on the 800 Gbit/s HIC-OTN interconnection, highly efficient collaborative training was achieved in two scenarios (52 km and 104 km clusters), delivering over 98% of the single-node training efficiency. Moreover, hitless and imperceptible optical network protection switching was demonstrated, ensuring zero impact on training performance.

Keywords