QuantMAC: Enhancing Hardware Performance in DNNs With Quantize Enabled Multiply-Accumulate Unit

Neha Ashar; Gopal Raut; Vasundhara Trivedi; Santosh Kumar Vishvakarma; Akash Kumar

doi:10.1109/ACCESS.2024.3379906

IEEE Access (Jan 2024)

QuantMAC: Enhancing Hardware Performance in DNNs With Quantize Enabled Multiply-Accumulate Unit

Neha Ashar,
Gopal Raut,
Vasundhara Trivedi,
Santosh Kumar Vishvakarma,
Akash Kumar

Affiliations

Neha Ashar: Department of Electrical Engineering, Indian Institute of Technology Indore, Indore, India
Gopal Raut: ORCiD; Department of Electrical Engineering, Indian Institute of Technology Indore, Indore, India
Vasundhara Trivedi: Department of Electrical Engineering, Indian Institute of Technology Indore, Indore, India
Santosh Kumar Vishvakarma: ORCiD; Department of Electrical Engineering, Indian Institute of Technology Indore, Indore, India
Akash Kumar: ORCiD; Chair for Processor Design, Center for Advancing Electronics Dresden, Technische Universität Dresden, Dresden, Germany

DOI: https://doi.org/10.1109/ACCESS.2024.3379906
Journal volume & issue: Vol. 12
pp. 43600 – 43614

Abstract

Read online

In response to the escalating demand for hardware-efficient Deep Neural Network (DNN) architectures, we present a novel quantize-enabled multiply-accumulate (MAC) unit. Our methodology employs a right shift-and-add computation for MAC operation, enabling runtime truncation without additional hardware. This architecture optimally utilizes hardware resources, enhancing throughput performance while reducing computational complexity through bit-truncation techniques. Our key methodology involves designing a hardware-efficient MAC computational algorithm that supports both iterative and pipeline implementations, catering to diverse hardware efficiency or enhanced throughput requirements in accelerators. Additionally, we introduce a processing element (PE) with a pre-loading bias scheme, reducing one clock delay and eliminating the need for conventional extra resources in PE implementation. The PE facilitates quantization-based MAC calculations through an efficient bit-truncation method, removing the necessity for extra hardware logic. This versatile PE accommodates variable bit-precision with a dynamic fraction part within the sfxpt< N,f $>$ representation, meeting specific model or layer demands. Through software emulation, our proposed approach demonstrates minimal accuracy loss, revealing under 1.6% loss for LeNet-5 using MNIST and around 4% for ResNet-18 and VGG-16 with CIFAR-10 in the sfxpt< 8 ,5 $>$ format compared to conventional float32-based implementations. Hardware performance parameters on the Xilinx-Virtex-7 board unveil a 37% reduction in area utilization and a 45% reduction in power consumption compared to the best state-of-the-art MAC architecture. Extending the proposed MAC to a LeNet DNN model results in a 42% reduction in resource requirements and a significant 27% reduction in delay. This architecture provides notable advantages for resource-efficient, high-throughput edge-AI applications.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords