NTT Multiplication for NTT-unfriendly Rings

Chi-Ming Marvin Chung; Vincent Hwang; Matthias J. Kannwischer; Gregor Seiler; Cheng-Jhih  Shih; Bo-Yin  Yang

doi:10.46586/tches.v2021.i2.159-188

Transactions on Cryptographic Hardware and Embedded Systems (Feb 2021)

NTT Multiplication for NTT-unfriendly Rings

Chi-Ming Marvin Chung,
Vincent Hwang,
Matthias J. Kannwischer,
Gregor Seiler,
Cheng-Jhih Shih,
Bo-Yin Yang

Affiliations

Chi-Ming Marvin Chung: Academia Sinica, Taipei, Taiwan; National Taiwan University, Taipei, Taiwan
Vincent Hwang: Academia Sinica, Taipei, Taiwan; National Taiwan University, Taipei, Taiwan
Matthias J. Kannwischer: Max Planck Institute for Security and Privacy, Bochum, Germany
Gregor Seiler: IBM Research – Zurich, Rüschlikon, Switzerland; ETH Zurich, Zurich, Switzerland
Cheng-Jhih Shih: Academia Sinica, Taipei, Taiwan; National Taiwan University, Taipei, Taiwan
Bo-Yin Yang: Academia Sinica, Taipei, Taiwan

DOI: https://doi.org/10.46586/tches.v2021.i2.159-188
Journal volume & issue: Vol. 2021, no. 2

Abstract

Read online

In this paper, we show how multiplication for polynomial rings used in the NIST PQC finalists Saber and NTRU can be efficiently implemented using the Number-theoretic transform (NTT). We obtain superior performance compared to the previous state of the art implementations using Toom–Cook multiplication on both NIST’s primary software optimization targets AVX2 and Cortex-M4. Interestingly, these two platforms require different approaches: On the Cortex-M4, we use 32-bit NTT-based polynomial multiplication, while on Intel we use two 16-bit NTT-based polynomial multiplications and combine the products using the Chinese Remainder Theorem (CRT). For Saber, the performance gain is particularly pronounced. On Cortex-M4, the Saber NTT-based matrix-vector multiplication is 61% faster than the Toom–Cook multiplication resulting in 22% fewer cycles for Saber encapsulation. For NTRU, the speed-up is less impressive, but still NTT-based multiplication performs better than Toom–Cook for all parameter sets on Cortex-M4. The NTT-based polynomial multiplication for NTRU-HRSS is 10% faster than Toom–Cook which results in a 6% cost reduction for encapsulation. On AVX2, we obtain speed-ups for three out of four NTRU parameter sets. As a further illustration, we also include code for AVX2 and Cortex-M4 for the Chinese Association for Cryptologic Research competition award winner LAC (also a NIST round 2 candidate) which outperforms existing code.

Published in Transactions on Cryptographic Hardware and Embedded Systems

ISSN: 2569-2925 (Online)
Publisher: Ruhr-Universität Bochum
Country of publisher: Germany
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: https://tches.iacr.org

About the journal

Abstract

Keywords