Speed up integer-arithmetic-only inference via bit-shifting

Mingjun Song; Yiming Zhou; Mengmeng Song; Sujie Liu; Shungen Xiao; Youshun Zheng

doi:10.1038/s41598-025-02544-4

Scientific Reports (May 2025)

Speed up integer-arithmetic-only inference via bit-shifting

Mingjun Song,
Yiming Zhou,
Mengmeng Song,
Sujie Liu,
Shungen Xiao,
Youshun Zheng

Affiliations

Mingjun Song: College of Information Engineering, Ningde Normal University
Yiming Zhou: Shanghai NIO Inc.Co., Ltd.
Mengmeng Song: College of Mechanical and Electrical Engineering, Ningde Normal University
Sujie Liu: College of Information Engineering, Ningde Normal University
Shungen Xiao: College of Information Engineering, Ningde Normal University
Youshun Zheng: College of Information Engineering, Ningde Normal University

DOI: https://doi.org/10.1038/s41598-025-02544-4
Journal volume & issue: Vol. 15, no. 1
pp. 1 – 12

Abstract

Read online

Abstract Quantization is a widely adopted technique in model deployment as it offers a favorable trade-off between computational overhead and performance loss. Integer-arithmetic-only quantization is an important approach in quantization and holds great significance for hardware deployment with limited resources. However, existing methods represented by IAO face challenges when balancing hardware efficiency and accuracy. They suffer from significant accuracy loss and the multipliers in element-wise layers are not constrained to be power-of-2. These issues limit the utilization of hardware resources. In this paper, we explore integer-arithmetic-only quantization and introduce two techniques: re-quantization and re-scale. Our approach ensures that inference in an 8-bit quantized network involves only 8-bit multiply-accumulate operations and bit-shifting. We compare our method with previous integer-arithmetic-only approaches and demonstrate that our approach not only accelerates inference speed and reduces resource consumption but also achieves minimal performance degradation. For the commonly used ResNet-50, our int8 model exhibits only a 0.5% drop in Top-1 accuracy, significantly outperforming previous integer-arithmetic-only methods. Moreover, the proposed method frees up digital signal processing resources and improve parallelism, achieving an improvement of approximately 27% in frames per second and reducing inference time by 27.09 ms (19%). This showcases the practical value and effectiveness of our proposed method in improving the overall efficiency of model inference.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal