BL-PIM: Varying the Burst Length to Realize the All-Bank Performance and Minimize the Multi-Workload Interference for in-DRAM PIM

Chang Hyun Kim; Won Jun Lee; Yoonah Paik; Seok Young Kim; Seon Wook Kim

doi:10.1109/ACCESS.2023.3300893

IEEE Access (Jan 2023)

BL-PIM: Varying the Burst Length to Realize the All-Bank Performance and Minimize the Multi-Workload Interference for in-DRAM PIM

Chang Hyun Kim,
Won Jun Lee,
Yoonah Paik,
Seok Young Kim,
Seon Wook Kim

Affiliations

Chang Hyun Kim: ORCiD; Department of Electrical Engineering, Korea University, Seoul, South Korea
Won Jun Lee: ORCiD; Department of Electrical Engineering, Korea University, Seoul, South Korea
Yoonah Paik: ORCiD; Department of Electrical Engineering, Korea University, Seoul, South Korea
Seok Young Kim: ORCiD; Department of Electrical Engineering, Korea University, Seoul, South Korea
Seon Wook Kim: ORCiD; Department of Electrical Engineering, Korea University, Seoul, South Korea

DOI: https://doi.org/10.1109/ACCESS.2023.3300893
Journal volume & issue: Vol. 11
pp. 81143 – 81156

Abstract

Read online

As the demand for transformer applications increases rapidly, technologies to solve memory bottlenecks are attracting attention. One of them is an in-DRAM Processing-In-Memory (PIM) architecture to perform the computation inside DRAM. Major DRAM makers introduce the PIM samples, executing all banks’ computations simultaneously to maximize the internal DRAM bandwidth for achieving the highest performance. However, the realization as a commercial product is problematic since the all-bank execution does not concurrently perform non-PIM applications during the PIM execution with PIM memory, thus separating their memory space. This paper proposes a BL-PIM architecture to increase the burst length (BL) of memory requests inside a bank to maximize internal bandwidth and overlap the computation across banks, thus achieving all-bank performance. On the other hand, outside a bank, it seems not to increase the BL, thus allowing us to preserve the data consistency in memory hierarchy and execute non-PIM and PIM applications together with PIM memory. Also, the memory-intensive PIM computation using larger BL significantly reduces their outstanding memory requests, thus minimizing the performance interference with other applications. We carefully extend the DRAM timing diagram and develop the cooperation mechanism between a memory controller and a PIM device. We implemented the BL-PIM architecture on FPGA and compared the performance with real machines using four transformer models and eight compute and memory-bound SPEC benchmarks. We achieved the BL-PIM performance up to 28.9x and 12.0x faster than the CPU single-thread and multi-threaded execution in the transformer models. Also, when we increased the burst length by 16 times as the maximum, the BL-PIM was 1.2x faster than the ideal all-bank PIM execution. We also experimented with the multi-workload execution using the SPEC benchmarks, showing that our architecture can minimize performance interference. To our knowledge, the study of the PIM’s multi-workload execution is the first in public.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords