Jisuanji kexue yu tansuo (Apr 2024)

TEB: Efficient SpMV Storage Format for Matrix Decomposition and Reconstruction on GPU

  • WANG Yuhua, ZHANG Yuqi, HE Junfei, XU Yuezhu, CUI Huanyu

DOI
https://doi.org/10.3778/j.issn.1673-9418.2304039
Journal volume & issue
Vol. 18, no. 4
pp. 1094 – 1108

Abstract

Read online

Sparse matrix-vector multiplication (SpMV) is a crucial computing process in the field of science and engineering. CSR (compressed sparse row) format is one of the most commonly used storage formats for sparse matrix. In the process of implementing parallel SpMV on the graphics processing unit (GPU), it only stores non-zero elements of sparse matrix, avoiding computational redundancy caused by zero element filling, and saving storage space. But there is a problem of load imbalance, which wastes computing resources. To address the aforementioned issues, storage formats with good performance in recent years have been studied, and a row by row decomposition and reorganization storage format—TEB (threshold-exchangeorder block) format has been proposed. The format first uses a heuristic threshold selection algorithm to determine the appropriate segmentation threshold, and combines the row merging algorithm based on reordering to reconstruct and decompose the sparse matrix, so that the number of non-zero elements between blocks is as close as possible. Furthermore, combined with CUDA (computer unified device architecture) thread technology, a parallel SpMV algorithm between sub blocks based on TEB storage format is proposed, which can reasonably allocate computing resources and solve the problem of load imbalance, thus improving the parallel computing efficiency of SpMV. In order to verify the effectiveness of the TEB storage format, experiments are conducted on the NVIDIA Tesla V100 platform. The results show that compared to PBC (partition-block-CSR), AMF-CSR (adaptive multi-row folding of CSR), CSR-Scalar (compressed sparse row-scalar), and CSR5 (compressed sparse row 5) storage formats, TEB can improve SpMV time performance by an average of 3.23×, 5.83×, 2.33×, and 2.21×. In terms of floating-point computing performance, the average improvement can be 3.36×, 5.95×, 2.29×, and 2.13×

Keywords