IEEE Access (Jan 2022)
Demand MemCpy: Overlapping of Computation and Data Transfer for Heterogeneous Computing
Abstract
Heterogeneous computing relies on collaboration among different types of processors on shared data. In systems with discrete accelerators (e.g., GP-GPU), data sharing requires transferring a large amount of data between CPU and accelerator memories and can significantly increase the end-to-end execution time. This paper proposes a novel mechanism called Demand MemCpy (DMC) to hide the data sharing overheads. DMC copies data from host memory to accelerator memory based on demands at page granularity. It utilizes a hardware-only mechanism to fetch the requested page with a short latency and the background pre-copy to fetch related pages in advance. Our evaluation shows that DMC can reduce the end-to-end execution time of GP-GPU application by 25.4% on average by overlapping computation with data transfer and not transferring unused pages.
Keywords