IEEE Access (Jan 2024)

Non-Invasive, Memory Access-Triggered Near-Data Processing for DNN Training Acceleration on GPUs

  • Hyungkyu Ham,
  • Hyunuk Cho,
  • Minjae Kim,
  • Jueon Park,
  • Jeongmin Hong,
  • Hyojin Sung,
  • Eunhyeok Park,
  • Euicheol Lim,
  • Gwangsun Kim

DOI
https://doi.org/10.1109/ACCESS.2024.3465789
Journal volume & issue
Vol. 12
pp. 142651 – 142667

Abstract

Read online

Currently, GPUs face significant challenges due to limited off-chip bandwidth (BW) and memory capacity during DNN training. To address these bottlenecks, we propose a memory access-triggered near-data processing matNDP architecture that offloads memory- and communication-bound operations. With matNDP, normal memory accesses also serve as implicit NDP requests to enable NDP in a non-invasive manner without modifying core-side ISA/microarchitecture/SW, for practicality. Additionally, matNDP enables on-the-fly NDP where the data already supplied in normal memory requests for compute-bound operations are also used for NDP; thus, matNDP can overlap even dependent kernels while also reducing memory traffic. Moreover, with the overlap, memory bandwidth (BW) underutilized by GPU cores can be used by NDP units to improve performance under the same total memory BW. The matNDP units can be deployed to heterogeneous memory devices in a system. First, we deploy them near GPU’s memory controllers. Secondly, our NDP units are deployed in memory expanders connected to multiple GPUs to create an NDP-enabled memory eXpander Network (NDPXNet). It can entirely offload gradient reduction and optimizer in data-parallel training, achieving additional speedups while eliminating redundancy in memory usage and optimizer execution. Thus, we 1) enable NDP without core HW/SW changes, 2) overlap the execution of dependent layers, and 3) offload both memory- and communication-bound operations from GPUs in DNN training. Through our deep learning compiler support, NDP kernels can be generated automatically without any model code modification. Consequently, matNDP can improve training throughput by up to $2.73\times $ and reduce energy by up to 41.4%.

Keywords