IEEE Access (Jan 2024)

Exploring Attention Sparsity to Accelerate Transformer Training on GPUs

  • Bokyeong Yoon,
  • Ah-Hyun Lee,
  • Jinsung Kim,
  • Gordon Euhyun Moon

DOI
https://doi.org/10.1109/ACCESS.2024.3425638
Journal volume & issue
Vol. 12
pp. 131373 – 131384

Abstract

Read online

The computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head attention computations, which dominate the overall Transformer training process. Previous approaches have sought to sparsify the multi-head attention before training by statically selecting the critical elements in the attention score matrix. However, since the critical elements in the attention score matrix can vary across different model tasks and datasets, dynamically considering the critical elements is essential for achieving better model quality. In this paper, we propose a new sparsity-aware Transformer that captures task- and input-dependent sparsity pattern in the attention score matrix during a small number of steps of the standard training of the Transformer. Then the identified sparsity pattern is utilized in the sparse training, transferred from the standard training, based on the degree of skewness and distance values of the attention score matrices. Experimental results demonstrate that our approach significantly reduces the number of operations in the multi-head attention operations, achieving up to $2.84\times $ training speedup, $6.87\times $ memory reduction and better accuracy compared to state-of-the-art sparse Transformer models.

Keywords