Jisuanji kexue (May 2022)

Deep Neural Network Operator Acceleration Library Optimization Based on Domestic Many-core Processor

  • GAO Jie, LIU Sha, HUANG Ze-qiang, ZHENG Tian-yu, LIU Xin, QI Feng-bin

DOI
https://doi.org/10.11896/jsjkx.210500226
Journal volume & issue
Vol. 49, no. 5
pp. 355 – 362

Abstract

Read online

Operator acceleration libraries based on different hardware devices have become an indispensable part of deep learning framework,which can provide performance improvement for large-scale training or inference tasks dramatically.The current main-stream operator libraries are all developed based on GPU architecture,which is not compatible with other heterogeneous designs.SWDNN operator library is based on the development of SW26010 processor,which can not give full play to the performance of the upgraded SW26010 pro processor,nor can it meet the needs of the current large neural network models such as GPT-3 for large memory capacity and high memory access bandwidth.According to the architecture characteristics of SW26010 pro processor and the training requirements of large neural network model,a three-level parallel and neural network operator task sche-duling scheme based on multi-core group is proposed,which can satisfy the memory requirements of large model training and improve the overall computing performance and parallel efficiency.A memory access optimization method with triple asynchronous flow and overlap of computation and memory access is proposed,which significantly alleviates the memory access performance bottleneck of neural network operators.Based on the above methods,this paper constructs the SWTensor many-core group operator acceleration library based on the SW26010 pro processor.The experimental results of natural language processing model GPT-2 show that,computation-intensive operators and memory access intensive operators in SWTensor operator library reach the maxi-mum of 90.4% and 88.7% of the theoretical peak values respectively in single-precision floating-point computing performance and memory access bandwidth.

Keywords