大数据 (Jan 2024)
Four issues to consider in building a computer system supporting large model training
Abstract
There are three types of computer systems that support large model training, among which the ecosystem based on domestic AI chip systems is not very good.To change this situation, it is necessary to develop 10 key software such as AI compilers and parallel acceleration.Moreover, systems based on supercomputers require good software and hardware collaborative design to better serve large model training.This article proposes a 4-point balanced design for building the infrastructure of a large model to ensure system performance, reliability, and scalability.