Multi-moduli NTTs for Saber on Cortex-M3 and Cortex-M4

Amin Abdulrahman; Jiun-Peng Chen; Yu-Jia Chen; Vincent Hwang; Matthias J. Kannwischer; Bo-Yin Yang

doi:10.46586/tches.v2022.i1.127-151

Transactions on Cryptographic Hardware and Embedded Systems (Nov 2021)

Multi-moduli NTTs for Saber on Cortex-M3 and Cortex-M4

Amin Abdulrahman,
Jiun-Peng Chen,
Yu-Jia Chen,
Vincent Hwang,
Matthias J. Kannwischer,
Bo-Yin Yang

Affiliations

Amin Abdulrahman: Ruhr University Bochum, Bochum, Germany; Max Planck Institute for Security and Privacy, Bochum, Germany
Jiun-Peng Chen: Academia Sinica, Taipei, Taiwan
Yu-Jia Chen: InfoKeyVault Technology (IKV), Taipei, Taiwan
Vincent Hwang: Academia Sinica, Taipei, Taiwan; National Taiwan University, Taipei, Taiwan
Matthias J. Kannwischer: Max Planck Institute for Security and Privacy, Bochum, Germany; Academia Sinica, Taipei, Taiwan
Bo-Yin Yang: Academia Sinica, Taipei, Taiwan

DOI: https://doi.org/10.46586/tches.v2022.i1.127-151
Journal volume & issue: Vol. 2022, no. 1

Abstract

Read online

The U.S. National Institute of Standards and Technology (NIST) has designated ARM microcontrollers as an important benchmarking platform for its Post-Quantum Cryptography standardization process (NISTPQC). In view of this, we explore the design space of the NISTPQC finalist Saber on the Cortex-M4 and its close relation, the Cortex-M3. In the process, we investigate various optimization strategies and memory-time tradeoffs for number-theoretic transforms (NTTs). Recent work by [Chung et al., TCHES 2021 (2)] has shown that NTT multiplication is superior compared to Toom–Cook multiplication for unprotected Saber implementations on the Cortex-M4 in terms of speed. However, it remains unclear if NTT multiplication can outperform Toom–Cook in masked implementations of Saber. Additionally, it is an open question if Saber with NTTs can outperform Toom–Cook in terms of stack usage. We answer both questions in the affirmative. Additionally, we present a Cortex-M3 implementation of Saber using NTTs outperforming an existing Toom–Cook implementation. Our stack-optimized unprotected M4 implementation uses around the same amount of stack as the most stack-optimized Toom–Cook implementation while being 33%-41% faster. Our speed-optimized masked M4 implementation is 16% faster than the fastest masked implementation using Toom–Cook. For the Cortex-M3, we outperform existing implementations by 29%-35% in speed. We conclude that for both stack- and speed-optimization purposes, one should base polynomial multiplications in Saber on the NTT rather than Toom–Cook for the Cortex-M4 and Cortex-M3. In particular, in many cases, multi-moduli NTTs perform best.

Published in Transactions on Cryptographic Hardware and Embedded Systems

ISSN: 2569-2925 (Online)
Publisher: Ruhr-Universität Bochum
Country of publisher: Germany
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: https://tches.iacr.org

About the journal

Abstract

Keywords