A Deep Learning Accelerator Based on a Streaming Architecture for Binary Neural Networks

Quang Hieu Vo; Ngoc Linh Le; Faaiz Asim; Lok-Won Kim; Choong Seon Hong

doi:10.1109/ACCESS.2022.3151916

IEEE Access (Jan 2022)

A Deep Learning Accelerator Based on a Streaming Architecture for Binary Neural Networks

Quang Hieu Vo,
Ngoc Linh Le,
Faaiz Asim,
Lok-Won Kim,
Choong Seon Hong

Affiliations

Quang Hieu Vo: ORCiD; Department of Computer Science and Engineering, Kyung Hee University, Yongin, Republic of Korea
Ngoc Linh Le: ORCiD; Department of Computer Science and Engineering, Kyung Hee University, Yongin, Republic of Korea
Faaiz Asim: Department of Computer Science and Engineering, Kyung Hee University, Yongin, Republic of Korea
Lok-Won Kim: ORCiD; Department of Computer Science and Engineering, Kyung Hee University, Yongin, Republic of Korea
Choong Seon Hong: ORCiD; Department of Computer Science and Engineering, Kyung Hee University, Yongin, Republic of Korea

DOI: https://doi.org/10.1109/ACCESS.2022.3151916
Journal volume & issue: Vol. 10
pp. 21141 – 21159

Abstract

Read online

Deep neural networks (DNNs) have played an increasingly important role in various areas such as computer vision and voice recognition. While training and validation become gradually feasible with high-end general-purpose processors such as graphical processor units (GPU), high throughput inferences in embedded hardware platforms with low hardware resources and power consumption efficiency are still challenging. Binarized neural networks (BNNs) are emerging as a promising method to overcome these challenges by reducing bit widths of DNN data representations with many optimal prior solutions. However, accuracy degradation is a considerable problem of the BNN, compared to the same architecture with full precision, while the binary neural networks still contain significant redundancy for optimization. In this paper, to address the limitations, we implement a streaming accelerator architecture with three optimization techniques: pipelining-unrolling for streaming each layer, weight reuse for parallel computation, and MAC (multiplication-accumulation) compression. Our method first constructs streaming architecture by pipelining-unrolling method to maximize throughput. Next, the weight reuse method with the K-mean cluster is applied to reduce the complexity of the popcount operation. Finally, MAC compression reduces hardware resources used for remaining computation on MAC operations. The implemented hardware accelerator integrated into a state-of-the-art field programable gate array (FPGA) provides the maximum performance of the classification at 1531k frames per second with 98.4% accuracy for the MNIST dataset and 205K frame per second with 80.2% accuracy for the Cifar-10 dataset. Besides, the proposed design’s ratio FPS/LUTs is approximately 57 (MNIST) and 0.707 (Cifar-10), which is much lower than the state-of-the-art design with a comparable throughput and inference accuracy.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords