Addressing GPU memory limitations for Graph Neural Networks in High-Energy Physics applications

Claire Songhyun Lee; V. Hewes; Giuseppe Cerati; Kewei Wang; Adam Aurisano; Ankit Agrawal; Alok Choudhary; Wei-Keng Liao

doi:10.3389/fhpcp.2024.1458674

Frontiers in High Performance Computing (Sep 2024)

Addressing GPU memory limitations for Graph Neural Networks in High-Energy Physics applications

Claire Songhyun Lee,
V. Hewes,
Giuseppe Cerati,
Kewei Wang,
Adam Aurisano,
Ankit Agrawal,
Alok Choudhary,
Wei-Keng Liao

Affiliations

Claire Songhyun Lee: Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, United States
V. Hewes: Department of Physics, University of Cincinnati, Cincinnati, OH, United States
Giuseppe Cerati: Fermilab, Data Science, Simulation, and Learning Division, Batavia, IL, United States
Kewei Wang: Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, United States
Adam Aurisano: Department of Physics, University of Cincinnati, Cincinnati, OH, United States
Ankit Agrawal: Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, United States
Alok Choudhary: Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, United States
Wei-Keng Liao: Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, United States

DOI: https://doi.org/10.3389/fhpcp.2024.1458674
Journal volume & issue: Vol. 2

Abstract

Read online

IntroductionReconstructing low-level particle tracks in neutrino physics can address some of the most fundamental questions about the universe. However, processing petabytes of raw data using deep learning techniques poses a challenging problem in the field of High Energy Physics (HEP). In the Exa.TrkX Project, an illustrative HEP application, preprocessed simulation data is fed into a state-of-art Graph Neural Network (GNN) model, accelerated by GPUs. However, limited GPU memory often leads to Out-of-Memory (OOM) exceptions during training, due to the large size of models and datasets. This problem is exacerbated when deploying models on High-Performance Computing (HPC) systems designed for large-scale applications.MethodsWe observe a high workload imbalance issue during GNN model training caused by the irregular sizes of input graph samples in HEP datasets, contributing to OOM exceptions. We aim to scale GNNs on HPC systems, by prioritizing workload balance in graph inputs while maintaining model accuracy. Our paper introduces diverse balancing strategies aimed at decreasing the maximum GPU memory footprint and avoiding the OOM exception, across various datasets.ResultsOur experiments showcase memory reduction of up to 32.14% compared to the baseline. We also demonstrate the proposed strategies can avoid OOM in application. Additionally, we create a distributed multi-GPU implementation using these samplers to demonstrate the scalability of these techniques on the HEP dataset.DiscussionBy assessing the performance of these strategies as data loading samplers across multiple datasets, we can gauge their effectiveness in both single-GPU and distributed environments. Our experiments, conducted on datasets of varying sizes and across multiple GPUs, broaden the applicability of our work to various GNN applications that handle input datasets with irregular graph sizes.

Published in Frontiers in High Performance Computing

ISSN: 2813-7337 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://www.frontiersin.org/journals/high-performance-computing

About the journal

Abstract

Keywords