PF-GEMV: Utilization maximizing architecture in fast matrix-vector multiplication for GPT-2 inference

Hyeji Kim; Yeongmin Lee; Chun-Gi Lyuh

doi:10.4218/etrij.2024-0111

ETRI Journal (Oct 2024)

PF-GEMV: Utilization maximizing architecture in fast matrix-vector multiplication for GPT-2 inference

Hyeji Kim,
Yeongmin Lee,
Chun-Gi Lyuh

Affiliations

Hyeji Kim
Yeongmin Lee
Chun-Gi Lyuh

DOI: https://doi.org/10.4218/etrij.2024-0111
Journal volume & issue: Vol. 46, no. 5
pp. 817 – 828

Abstract

Read online

Owing to the widespread advancement of transformer-based artificial neural networks, artificial intelligence (AI) processors are now required to perform matrix-vector multiplication in addition to the conventional matrix-matrix multiplication. However, current AI processor architectures are optimized for general matrix-matrix multiplications (GEMMs), which causes significant throughput degradation when processing general matrix-vector multiplications (GEMVs). In this study, we proposed a port-folding GEMV (PF-GEMV) scheme employing multiformat and low-precision techniques while reusing an outer product-based processor optimized for conventional GEMM operations. This approach achieves 93.7% utilization in GEMV operations with an 8-bit format on an 8 X 8 processor, thus resulting in a 7.5 X increase in throughput compared with that of the original scheme. Furthermore, when applied to the matrix operation of the GPT-2 large model, an increase in speed by 7 X is achieved in single-batch inferences.

Published in ETRI Journal

ISSN: 1225-6463 (Print); 2233-7326 (Online)
Publisher: Electronics and Telecommunications Research Institute (ETRI)
Country of publisher: Korea, Republic of
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Telecommunication; Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics
Website: https://onlinelibrary.wiley.com/journal/22337326

About the journal

Abstract

Keywords