High-Level Design of Precision-Scalable DNN Accelerators Based on Sum-Together Multipliers

Luca Urbinati; Mario R. Casu

doi:10.1109/access.2024.3380472

IEEE Access (Jan 2024)

High-Level Design of Precision-Scalable DNN Accelerators Based on Sum-Together Multipliers

Luca Urbinati,
Mario R. Casu

Affiliations

Luca Urbinati: ORCiD; Department of Electronics and Telecommunications, Politecnico di Torino, Turin, Italy
Mario R. Casu: ORCiD; Department of Electronics and Telecommunications, Politecnico di Torino, Turin, Italy

DOI: https://doi.org/10.1109/access.2024.3380472
Journal volume & issue: Vol. 12
pp. 44163 – 44189

Abstract

Read online

Precison-scalable (PS) multipliers are gaining traction in Deep Neural Network accelerators, particularly for enabling mixed-precision (MP) quantization in Deep Learning at the edge. This paper focuses on the Sum-Together (ST) class of PS multipliers, which are subword-parallel multipliers that can execute a standard multiplication at full precision or a dot-product with parallel low-precision operands. Our contributions in this area encompass multiple aspects: we enrich our previous comparison of SoA ST multipliers by including our recent radix-4 Booth ST multiplier and two novel designs; we extend the explanation of the architecture and the design flow of our previously proposed ST-based PS hardware accelerators designed for 2D-Convolution, Depth-wise Convolution, and Fully-Connected layers that we developed using High-Level Synthesis (HLS); we implement the uniform integer quantization equations in hardware; we conduct a broad HLS-driven design space exploration of our ST-based accelerators, varying numerous hardware parameters; finally, we showcase the advantages of ST-based accelerators when integrated into System-on-Chips (SoCs) in three different scenarios (low-area, low-power, and low-latency), running inference on MP-quantized MLPerf Tiny models as case study. Across the three scenarios, the results show an average latency speedup of 1.46x, 1.33x, and 1.29x, a reduced energy consumption in most of the cases, and a marginal area overhead of 0.9%, 2.5% and 8.0%, compared to SoCs with accelerators based on fixed-precision 16-bit multipliers. To sum up, our work provides a comprehensive understanding of ST-based accelerators’ performance in an SoC context, paving the way for future enhancements and the solution of identified inefficiencies.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords