Jisuanji kexue (Feb 2023)

Tensor Instruction Generation Optimization Fusing with Loop Partitioning

  • LIANG Jiali, HUA Baojian, SU Shaobo

DOI
https://doi.org/10.11896/jsjkx.220300147
Journal volume & issue
Vol. 50, no. 2
pp. 374 – 383

Abstract

Read online

The tensor compiler compiles the tensor algorithm and schedule of the operator into the code of the target hardware.In order to accelerate tensor operation,the special processor in the field of deep learning is designed as a special architecture with special instructions,which supports multi-core parallel,multi-level special memory architecture and tensor calculation.On top of the hardware,there is a tensor instruction set closely related to the characteristics of the hardware.In such a complex architecture,the use of tensor instructions has many constraints and limitations,and there are the following problems and challenges.Firstly,the conditional branches introduced by loop tiling such as computing task division or data chunking increase the difficulty of pattern matching.Secondly,tensor instructions have hardware constraints such as alignment and data layout.To solve the above problems and research challenges,an optimization algorithm of tensor instruction ge-neration based on loop partitioning is proposed.By dividing the loop interval,the algorithm eliminates the conditional branches introduced by task division or data segmentation.The instruction and hardware constraints are solved by filling zeros,replacing equivalent instructions and adding additional calculations.The tensor instruction is generated by pattern matching method.This paper studies and extends the open source deep learning compiler TVM version 0.7,and implements a compiler prototype system supporting tensor instruction ge-neration of DianNao architecture machine learning accelerator.In order to evaluate the effectiveness of the algorithm,the operator performance and development efficiency of element-wise binary tensor operator,in-place unary tensor operator and convolution operator are tested on the DianNao architecture machine learning accelerator hardware platform.Experimental results show that the average speedup of the three types of operators is 125.00%,the maximum speedup is 194.00%,and the maximum development efficiency increases by 7 times.

Keywords