Array (Dec 2021)
Sparse and dense matrix multiplication hardware for heterogeneous multi-precision neural networks
Abstract
In this paper, we present hardware accelerators created with high-level synthesis techniques for sparse and dense matrix multiplication operations. The cores can operate with different precisions and are designed to be integrated in a heterogeneous CPU-FPGA system for Edge AI applications. The methodology involves quantization-sparsity aware training and it is applied to a case study consisting of human activity classification. We initially investigate the effects of quantization and sparsity on the accuracy of neural networks with convolution, dense and recurrent layers observing better tolerance to pruning when recurrent layers are present. Then, we propose the hardware accelerators that can switch precision at run-time and work with any matrix size up to a maximum configured at compile time. We compare the performance of these accelerators at different levels of precision and sparsity levels and create a performance model to enable workload balancing. The results show that the proposed sparse matrix multipliers can outperform dense multipliers when sparsity levels are higher than 70% and the improvements are more evident when higher precision arithmetic or structural pruning is used. Additionally, sparsity levels as high as 99% can maintain the level of accuracy required in the network especially when recurrent layers are deployed. Overall, the balance between sparse and dense performance depends on matrix shape, precision, structural pruning and sparsity levels and performance modelling can be used to balance concurrent execution in a heterogeneous configuration.