Scientific Reports (May 2025)

Speed up integer-arithmetic-only inference via bit-shifting

  • Mingjun Song,
  • Yiming Zhou,
  • Mengmeng Song,
  • Sujie Liu,
  • Shungen Xiao,
  • Youshun Zheng

DOI
https://doi.org/10.1038/s41598-025-02544-4
Journal volume & issue
Vol. 15, no. 1
pp. 1 – 12

Abstract

Read online

Abstract Quantization is a widely adopted technique in model deployment as it offers a favorable trade-off between computational overhead and performance loss. Integer-arithmetic-only quantization is an important approach in quantization and holds great significance for hardware deployment with limited resources. However, existing methods represented by IAO face challenges when balancing hardware efficiency and accuracy. They suffer from significant accuracy loss and the multipliers in element-wise layers are not constrained to be power-of-2. These issues limit the utilization of hardware resources. In this paper, we explore integer-arithmetic-only quantization and introduce two techniques: re-quantization and re-scale. Our approach ensures that inference in an 8-bit quantized network involves only 8-bit multiply-accumulate operations and bit-shifting. We compare our method with previous integer-arithmetic-only approaches and demonstrate that our approach not only accelerates inference speed and reduces resource consumption but also achieves minimal performance degradation. For the commonly used ResNet-50, our int8 model exhibits only a 0.5% drop in Top-1 accuracy, significantly outperforming previous integer-arithmetic-only methods. Moreover, the proposed method frees up digital signal processing resources and improve parallelism, achieving an improvement of approximately 27% in frames per second and reducing inference time by 27.09 ms (19%). This showcases the practical value and effectiveness of our proposed method in improving the overall efficiency of model inference.