Applied Sciences (Dec 2023)

Benchmarking GPU Tensor Cores on General Matrix Multiplication Kernels through CUTLASS

  • Xuanteng Huang,
  • Xianwei Zhang,
  • Panfei Yang,
  • Nong Xiao

DOI
https://doi.org/10.3390/app132413022
Journal volume & issue
Vol. 13, no. 24
p. 13022

Abstract

Read online

GPUs have been broadly used to accelerate big data analytics, scientific computing and machine intelligence. Particularly, matrix multiplication and convolution are two principal operations that use a large proportion of steps in modern data analysis and deep neural networks. These performance-critical operations are often offloaded to the GPU to obtain substantial improvements in end-to-end latency. In addition, multifarious workload characteristics and complicated processing phases in big data demand a customizable yet performant operator library. To this end, GPU vendors, including NVIDIA and AMD, have proposed template and composable GPU operator libraries to conduct specific computations on certain types of low-precision data elements. We formalize a set of benchmarks via CUTLASS, NVIDIA’s templated library that provides high-performance and hierarchically designed kernels. The benchmarking results show that, with the necessary fine tuning, hardware-level ASICs like tensor cores could dramatically boost performance in specific operations like GEMM offloading to modern GPUs.

Keywords