A 64-TOPS Energy-Efficient Tensor Accelerator in 14nm With Reconfigurable Fetch Network and Processing Fusion for Maximal Data Reuse

Sang Min Lee; Hanjoon Kim; Jeseung Yeon; Juyun Lee; Younggeun Choi; Minho Kim; Changjae Park; Kiseok Jang; Youngsik Kim; Yongseung Kim; Changman Lee; Hyuck Han; Won Eung Kim; Rui Tang; Joon Ho Baek

doi:10.1109/OJSSCS.2022.3216798

IEEE Open Journal of the Solid-State Circuits Society (Jan 2022)

A 64-TOPS Energy-Efficient Tensor Accelerator in 14nm With Reconfigurable Fetch Network and Processing Fusion for Maximal Data Reuse

Sang Min Lee,
Hanjoon Kim,
Jeseung Yeon,
Juyun Lee,
Younggeun Choi,
Minho Kim,
Changjae Park,
Kiseok Jang,
Youngsik Kim,
Yongseung Kim,
Changman Lee,
Hyuck Han,
Won Eung Kim,
Rui Tang,
Joon Ho Baek

Affiliations

Sang Min Lee: ORCiD; Hardware Department, FuriosaAI, Inc., Seoul, Republic of Korea
Hanjoon Kim: Hardware Department, FuriosaAI, Inc., Seoul, Republic of Korea
Jeseung Yeon: ORCiD; Hardware Department, FuriosaAI, Inc., Seoul, Republic of Korea
Juyun Lee: ORCiD; Hardware Department, FuriosaAI, Inc., Seoul, Republic of Korea
Younggeun Choi: ORCiD; Hardware Department, FuriosaAI, Inc., Seoul, Republic of Korea
Minho Kim: ORCiD; Hardware Department, FuriosaAI, Inc., Seoul, Republic of Korea
Changjae Park: Hardware Department, FuriosaAI, Inc., Seoul, Republic of Korea
Kiseok Jang: Hardware Department, FuriosaAI, Inc., Seoul, Republic of Korea
Youngsik Kim: Hardware Department, FuriosaAI, Inc., Seoul, Republic of Korea
Yongseung Kim: ORCiD; Hardware Department, FuriosaAI, Inc., Seoul, Republic of Korea
Changman Lee: Hardware Department, FuriosaAI, Inc., Seoul, Republic of Korea
Hyuck Han: Hardware Department, FuriosaAI, Inc., Seoul, Republic of Korea
Won Eung Kim: Hardware Department, FuriosaAI, Inc., Seoul, Republic of Korea
Rui Tang: Marketing, Strategy and Operations Department, MSQUARE Ltd., Shanghai, China
Joon Ho Baek: Hardware Department, FuriosaAI, Inc., Seoul, Republic of Korea

DOI: https://doi.org/10.1109/OJSSCS.2022.3216798
Journal volume & issue: Vol. 2
pp. 219 – 230

Abstract

Read online

For energy-efficient accelerators in data centers that leverage advances in the performance and energy efficiency of recent algorithms, flexible architectures are critical to support state-of-the-art algorithms for various deep learning tasks. Due to the matrix multiplication units at the core of tensor operations, most recent programmable architectures lack flexibility for layers with diminished dimensions, especially for inferences where a large batch axis is rarely allowed. In addition, exploiting the data reuse inherent within tensor operations for computing a single matrix multiplication is challenging. In this work, an extension of a vector processor in 14 nm is proposed, which is customized to tensor operations. The flexible architecture enables a tensorized loop to support various data layouts and different shapes and sizes of tensor operations. It also exploits all possible data reuse, including input, weight, and output. Based on the tensorized loop, fetch and reduction networks, which unicast or multicast with the ordering of both input data and processing data, can be simplified using a circuit-switching-like network with configured topology and flow control for each tensor operation. Two processing elements can be fused to optimize latency for a large model or can operate individually for throughput. As a result, various state-of-the-art models can be processed efficiently with straightforward compiler optimization, and the highest energy efficiency of 13.4Inferences/s/W on EfficientNetV2-S is demonstrated.

Published in IEEE Open Journal of the Solid-State Circuits Society

ISSN: 2644-1349 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electric apparatus and materials. Electric circuits. Electric networks
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=8782712

About the journal

Abstract

Keywords