IEEE Access (Jan 2023)
Analyzing Data Locality on GPU Caches Using Static Profiling of Workloads
Abstract
The diversity of workloads drives studies to use GPU more effectively to overcome the limited memory of GPUs. Precisely, it is essential to understand and utilize data locality of workloads to utilize the memory and cache efficiently, which is relatively smaller than CPU ’ s. It is important to understand GPU memory hierarchy to efficiently use with multi-thread environment. Although there have been previous approaches to analyzing data locality on GPUs, these approaches focused on global memory and L2 cache levels with profiling at thread block levels. Data locality study in warp level in GPU has not been studied much. Especially, the concept of coalescing has been defined but the method of measuring the degree of coalescing has not been discussed. Our study focused on analyzing data locality in L1 cache levels, which is the smallest but fastest in cache level to analyze the impact of data locality. To achieve this analysis, our study profiles data locality in warp level, which is smallest segment in GPU thread groups. This paper introduces a novel perspective by introducing a quantitative measure for coalescing alongside static profiling of data locality. Furthermore, it offers a means of refining locality estimates by scrutinizing access patterns of L1 cache. To substantiate our approach, our study validates the estimated data locality against a range of real-world GPU benchmarks, including Rodina and Polybench. Through empirical experimentation, our results reveal a substantial correlation between the metrics of data locality and cache utilization, affirming the efficacy of our proposed method.
Keywords