IEEE Access (Jan 2024)

Better Scalability: Improvement of Block-Based CNN Accelerator for FPGAs

  • Yan Chen,
  • Kiyofumi Tanaka

DOI
https://doi.org/10.1109/ACCESS.2024.3514325
Journal volume & issue
Vol. 12
pp. 187587 – 187603

Abstract

Read online

As Convolutional Neural Networks (CNN) have become widely used, numerous accelerators have been designed, which are mainly divided into two architectures: Overlay architecture with a single Processing Element (PE) array and Dataflow architecture with one PE array per layer. Overlay architecture accelerators require a large amount of off-chip memory bandwidth, whereas Dataflow architecture accelerators require a large amount of on-chip memory capacity. We designed a hybrid architecture based on the characteristics of modern CNN models composed of repetitive blocks, effectively combining the advantages of both architectures and avoiding their drawbacks. It has been proven to achieve extremely high throughput while requiring less than 8% of the bandwidth for Overlay architecture accelerators to run MobileNetV2. Unlike Dataflow architecture accelerators, it does not require significant on-chip memory. A comparison shows that its area efficiency far surpasses that of existing works. However, its scalability remains suboptimal, and this study aims to address this issue. The improved accelerator demonstrated consistent efficiency across the capacity range of existing devices and was successfully implemented on a compact 7Z007S. When deployed on a large-scale VU13P, it achieved an impressive throughput exceeding 10000 frames per second when running MobileNetV2.

Keywords