IEEE Access (Jan 2024)
How Much Can We Gain From Tensor Kernel Fusion on GPUs?
Abstract
Kernel fusion is a crucial optimization technique for GPU applications, particularly deep neural networks, where it involves combining multiple consecutive kernels into a single larger kernel. This approach aims to enhance performance by reducing the need for slow off-chip memory accesses. Instead, intermediate results between successive kernels are stored in faster on-chip memory like shared memory. This strategy has the potential to not only boost performance, but also reduce energy consumption. Typically, GPU kernels fall into two categories: tensor operations and element operations. In deep learning, fusing a tensor operation kernel with an element operation kernel that follows it, such as combining convolution with ReLU, is a common practice to achieve improved performance. While combining two tensor kernels in a single GPU kernel has shown benefits in certain applications, it is not a straightforward task. The advantages and limitations of this approach remain unclear, prompting several questions: 1) What advantages does tensor kernel fusion offer on GPGPUs? 2) What limitations does it have and why is it not widely adopted? 3) In what practical scenarios is tensor kernel fusion beneficial? To address these questions, we conducted both analytical and experimental studies on Nvidia Tensor Core GPUs, using the CUTLASS kernel library with extensions. Our experimental findings revealed that for tall and narrow matrix multiplications, employing a 1D tiling strategy outperforms the commonly used 2D tiling strategy. By comparing tensor kernel fusions with a 1D tiling baseline, we demonstrated significant performance gains for tall and narrow matrix multiplications with fusion. However, we observe that these benefits diminish as the matrix sizes increase in width.
Keywords