PISA-DMA: Processing-in-Memory Instruction Set Architecture Using DMA

Won Jun Lee; Chang Hyun Kim; Yoonah Paik; Seon Wook Kim

doi:10.1109/ACCESS.2023.3238812

IEEE Access (Jan 2023)

PISA-DMA: Processing-in-Memory Instruction Set Architecture Using DMA

Won Jun Lee,
Chang Hyun Kim,
Yoonah Paik,
Seon Wook Kim

Affiliations

Won Jun Lee: ORCiD; Department of Electrical Engineering, Korea University, Seoul, South Korea
Chang Hyun Kim: ORCiD; Department of Electrical Engineering, Korea University, Seoul, South Korea
Yoonah Paik: ORCiD; Department of Electrical Engineering, Korea University, Seoul, South Korea
Seon Wook Kim: ORCiD; Department of Electrical Engineering, Korea University, Seoul, South Korea

DOI: https://doi.org/10.1109/ACCESS.2023.3238812
Journal volume & issue: Vol. 11
pp. 8622 – 8632

Abstract

Read online

Processing-in-memory (PIM) has attracted attention to overcome the memory bandwidth limitation, especially for computing memory-intensive DNN applications. Most PIM approaches use the CPU’s memory requests to deliver instructions and operands to the PIM engines, making a core busy and incurring unnecessary data transfer, thus, resulting in significant offloading overhead. DMA can resolve the issue by transferring a high volume of successive data without intervening CPU and polluting the memory hierarchy, thus perfectly fitting the PIM concept. However, the small computing resources of DRAM-based PIM devices allow us to transfer only small amounts of data at one DMA transaction and require a large number of descriptors, thus still incurring significant offloading overhead. This paper introduces PIM Instruction Set Architecture (ISA) using a DMA descriptor called PISA-DMA to express a PIM opcode and operand in a single descriptor. Our ISA makes PIM programming intuitive by thinking of committing one PIM instruction as completing one DMA transaction and representing a sequence of PIM instructions using the DMA descriptor list. Also, PISA-DMA minimizes the offloading overhead while guaranteeing compatibility with commercial platforms. Our PISA-DMA eliminates the opcode offloading overhead and achieves 1.25x, 1.31x, and 1.29x speedup over the baseline PIM at the sequence length of 128 with the BERT, RoBERTa, and GPT-2 models, respectively, in ONNX runtime with real machines. Also, we study how our proposed PISA affects performance in compiler optimization and show that the operator fusion of matrix-matrix multiplication and element-wise addition achieved 1.04x speedup, a similar performance gain using conventional ISAs.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords