How Much Can We Gain From Tensor Kernel Fusion on GPUs?

Wei Sun; Ang Li; Sander Stuijk; Henk Corporaal

doi:10.1109/ACCESS.2024.3411473

IEEE Access (Jan 2024)

How Much Can We Gain From Tensor Kernel Fusion on GPUs?

Wei Sun,
Ang Li,
Sander Stuijk,
Henk Corporaal

Affiliations

Wei Sun: ORCiD; Electronic System Group, Eindhoven University of Technology, Eindhoven, The Netherlands
Ang Li: ORCiD; Physical and Computational Sciences Directorate, Pacific Northwest National Laboratory, Richland, WA, USA
Sander Stuijk: ORCiD; Electronic System Group, Eindhoven University of Technology, Eindhoven, The Netherlands
Henk Corporaal: ORCiD; Electronic System Group, Eindhoven University of Technology, Eindhoven, The Netherlands

DOI: https://doi.org/10.1109/ACCESS.2024.3411473
Journal volume & issue: Vol. 12
pp. 126135 – 126144

Abstract

Read online

Kernel fusion is a crucial optimization technique for GPU applications, particularly deep neural networks, where it involves combining multiple consecutive kernels into a single larger kernel. This approach aims to enhance performance by reducing the need for slow off-chip memory accesses. Instead, intermediate results between successive kernels are stored in faster on-chip memory like shared memory. This strategy has the potential to not only boost performance, but also reduce energy consumption. Typically, GPU kernels fall into two categories: tensor operations and element operations. In deep learning, fusing a tensor operation kernel with an element operation kernel that follows it, such as combining convolution with ReLU, is a common practice to achieve improved performance. While combining two tensor kernels in a single GPU kernel has shown benefits in certain applications, it is not a straightforward task. The advantages and limitations of this approach remain unclear, prompting several questions: 1) What advantages does tensor kernel fusion offer on GPGPUs? 2) What limitations does it have and why is it not widely adopted? 3) In what practical scenarios is tensor kernel fusion beneficial? To address these questions, we conducted both analytical and experimental studies on Nvidia Tensor Core GPUs, using the CUTLASS kernel library with extensions. Our experimental findings revealed that for tall and narrow matrix multiplications, employing a 1D tiling strategy outperforms the commonly used 2D tiling strategy. By comparing tensor kernel fusions with a 1D tiling baseline, we demonstrated significant performance gains for tall and narrow matrix multiplications with fusion. However, we observe that these benefits diminish as the matrix sizes increase in width.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords