Exploring Attention Sparsity to Accelerate Transformer Training on GPUs

Bokyeong Yoon; Ah-Hyun Lee; Jinsung Kim; Gordon Euhyun Moon

doi:10.1109/ACCESS.2024.3425638

IEEE Access (Jan 2024)

Exploring Attention Sparsity to Accelerate Transformer Training on GPUs

Bokyeong Yoon,
Ah-Hyun Lee,
Jinsung Kim,
Gordon Euhyun Moon

Affiliations

Bokyeong Yoon: ORCiD; Department of Computer Science and Engineering, Sogang University, Seoul, Republic of Korea
Ah-Hyun Lee: Department of Computer Science and Engineering, Sogang University, Seoul, Republic of Korea
Jinsung Kim: ORCiD; School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea
Gordon Euhyun Moon: ORCiD; Department of Computer Science and Engineering, Sogang University, Seoul, Republic of Korea

DOI: https://doi.org/10.1109/ACCESS.2024.3425638
Journal volume & issue: Vol. 12
pp. 131373 – 131384

Abstract

Read online

The computational complexity required for training a Transformer model quadratically increases as the length of the input sequence increases. Therefore, to accelerate the training of a large-scale Transformer with long sequences, it is crucial to reduce the number of operations for the multi-head attention computations, which dominate the overall Transformer training process. Previous approaches have sought to sparsify the multi-head attention before training by statically selecting the critical elements in the attention score matrix. However, since the critical elements in the attention score matrix can vary across different model tasks and datasets, dynamically considering the critical elements is essential for achieving better model quality. In this paper, we propose a new sparsity-aware Transformer that captures task- and input-dependent sparsity pattern in the attention score matrix during a small number of steps of the standard training of the Transformer. Then the identified sparsity pattern is utilized in the sparse training, transferred from the standard training, based on the degree of skewness and distance values of the attention score matrices. Experimental results demonstrate that our approach significantly reduces the number of operations in the multi-head attention operations, achieving up to $2.84\times $ training speedup, $6.87\times $ memory reduction and better accuracy compared to state-of-the-art sparse Transformer models.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords