Load-Balanced Parallel Implementation on GPUs for Multi-Scalar Multiplication Algorithm

Yutian Chen; Cong Peng; Yu Dai; Min Luo; Debiao He

doi:10.46586/tches.v2024.i2.522-544

Transactions on Cryptographic Hardware and Embedded Systems (Mar 2024)

Load-Balanced Parallel Implementation on GPUs for Multi-Scalar Multiplication Algorithm

Yutian Chen,
Cong Peng,
Yu Dai,
Min Luo,
Debiao He

Affiliations

Yutian Chen: School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Cong Peng: School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Yu Dai: School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Min Luo: School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Debiao He: School of Cyber Science and Engineering, Wuhan University, Wuhan, China

DOI: https://doi.org/10.46586/tches.v2024.i2.522-544
Journal volume & issue: Vol. 2024, no. 2

Abstract

Read online

Multi-scalar multiplication (MSM) is an important building block in most of elliptic-curve-based zero-knowledge proof systems, such as Groth16 and PLONK. Recently, Lu et al. proposed cuZK, a new parallel MSM algorithm on GPUs. In this paper, we revisit this scheme and present a new GPU-based implementation to further improve the performance of MSM algorithm. First, we propose a novel method for mapping scalars into Pippenger’s bucket indices, largely reducing the number of buckets compared to the original Pippenger algorithm. Second, in the case that memory is sufficient, we develop a new efficient algorithm based on homogeneous coordinates in the bucket accumulation phase. Moreover, our accumulation phase is load-balanced, which means the parallel speedup ratio is almost linear growth as the number of device threads increases. Finally, we also propose a parallel layered reduction algorithm for the bucket aggregation phase, whose time complexity remains at the logarithmic level of the number of buckets. The implementation results over the BLS12-381 curve on the V100 graphics card show that our proposed algorithm achieves up to 1.998x, 1.821x and 1.818x speedup compared to cuZK at scales of 221, 222, and 223, respectively.

Published in Transactions on Cryptographic Hardware and Embedded Systems

ISSN: 2569-2925 (Online)
Publisher: Ruhr-Universität Bochum
Country of publisher: Germany
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: https://tches.iacr.org

About the journal

Abstract

Keywords