IEEE Access (Jan 2020)
An Energy-Efficient Implementation of Group Pruned CNNs on FPGA
Abstract
In recent years, convolutional neural network (CNN)-based artificial intelligence algorithms have been widely applied to object recognition and image classification tasks. However, the high performance of convolutional neural networks comes at the cost of high-intensity computing and enormous numbers of parameters, which pose substantial challenges to terminal implementations. An end-to-end FPGA-based accelerator is proposed in this work that efficiently processes fine-grained pruned CNNs. A group pruning algorithm with group sparse regularization (GSR) is introduced to solve internal buffer misalignments and load imbalances of the accelerator after fine-grained pruning. A mathematical model of accelerator access and transmission is established to explore the optimal design scale and calculation mode. The accelerator is optimized by designing sparse processing elements and by scheduling the on- and off-chip buffers. The proposed approach reduces the computation of a state-of-the-art large-scale CNN, VGG16, by 86.9% with an accuracy loss on CIFAR-10 of only 0.48%. The accelerator achieves 188.41 GOPS at 100 MHz and consumes 8.15 W when implemented on a Xilinx VC707, making it more energy-efficient than previous approaches.
Keywords