Arithmetic Coding-Based 5-Bit Weight Encoding and Hardware Decoder for CNN Inference in Edge Devices

Jong Hun Lee; Joonho Kong; Arslan Munir

doi:10.1109/ACCESS.2021.3136888

IEEE Access (Jan 2021)

Arithmetic Coding-Based 5-Bit Weight Encoding and Hardware Decoder for CNN Inference in Edge Devices

Jong Hun Lee,
Joonho Kong,
Arslan Munir

Affiliations

Jong Hun Lee: School of Electronic and Electrical Engineering, Kyungpook National University, Daegu, South Korea
Joonho Kong: ORCiD; School of Electronic and Electrical Engineering, Kyungpook National University, Daegu, South Korea
Arslan Munir: ORCiD; Department of Computer Science, Kansas State University, Manhattan, KS, USA

DOI: https://doi.org/10.1109/ACCESS.2021.3136888
Journal volume & issue: Vol. 9
pp. 166736 – 166749

Abstract

Read online

Convolutional neural networks (CNNs) have gained a huge attention for real-world artificial intelligence (AI) applications such as image classification and object detection. On the other hand, for better accuracy, the size of the CNNs’ parameters (weights) has been increasing, which in turn makes it difficult to enable on-device CNN inferences in resource-constrained edge devices. Though weight pruning and 5-bit quantization methods have shown promising results, it is still challenging to deploy large CNN models in edge devices. In this paper, we propose an encoding and hardware-based decoding technique which can be applied to 5-bit quantized weight data for on-device CNN inferences in resource-constrained edge devices. Given 5-bit quantized weight data, we employ arithmetic coding with range scaling for lossless weight compression, which is performed offline. When executing on-device inferences with underlying CNN accelerators, our hardware decoder enables a fast in-situ weight decompression with small latency overhead. According to our evaluation results with five widely used CNN models, our arithmetic coding-based encoding method applied to 5-bit quantized weights shows a better compression ratio by $9.6\times $ while also reducing the memory data transfer energy consumption by 89.2%, on average, as compared to the case of uncompressed 32-bit floating-point weights. When applying our technique to pruned weights, we obtain better compression ratios by $57.5\times $ – $112.2\times $ while reducing energy consumption by 98.3%–99.1% as compared to the case of 32-bit floating-point weights. In addition, by pipelining the weight decoding and transfer with the CNN execution, the latency overhead of our weight decoding with 16 decoding unit (DU) hardware is only 0.16%–5.48% and 0.16%–0.91% for non-pruned and pruned weights, respectively. Moreover, our proposed technique with 4-DU decoder hardware reduces system-level energy consumption by 1.1%–9.3%.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords