IEEE Access (Jan 2022)

A Deep Learning Accelerator Based on a Streaming Architecture for Binary Neural Networks

  • Quang Hieu Vo,
  • Ngoc Linh Le,
  • Faaiz Asim,
  • Lok-Won Kim,
  • Choong Seon Hong

DOI
https://doi.org/10.1109/ACCESS.2022.3151916
Journal volume & issue
Vol. 10
pp. 21141 – 21159

Abstract

Read online

Deep neural networks (DNNs) have played an increasingly important role in various areas such as computer vision and voice recognition. While training and validation become gradually feasible with high-end general-purpose processors such as graphical processor units (GPU), high throughput inferences in embedded hardware platforms with low hardware resources and power consumption efficiency are still challenging. Binarized neural networks (BNNs) are emerging as a promising method to overcome these challenges by reducing bit widths of DNN data representations with many optimal prior solutions. However, accuracy degradation is a considerable problem of the BNN, compared to the same architecture with full precision, while the binary neural networks still contain significant redundancy for optimization. In this paper, to address the limitations, we implement a streaming accelerator architecture with three optimization techniques: pipelining-unrolling for streaming each layer, weight reuse for parallel computation, and MAC (multiplication-accumulation) compression. Our method first constructs streaming architecture by pipelining-unrolling method to maximize throughput. Next, the weight reuse method with the K-mean cluster is applied to reduce the complexity of the popcount operation. Finally, MAC compression reduces hardware resources used for remaining computation on MAC operations. The implemented hardware accelerator integrated into a state-of-the-art field programable gate array (FPGA) provides the maximum performance of the classification at 1531k frames per second with 98.4% accuracy for the MNIST dataset and 205K frame per second with 80.2% accuracy for the Cifar-10 dataset. Besides, the proposed design’s ratio FPS/LUTs is approximately 57 (MNIST) and 0.707 (Cifar-10), which is much lower than the state-of-the-art design with a comparable throughput and inference accuracy.

Keywords