Practical Near-Data-Processing Architecture for Large-Scale Distributed Graph Neural Network

Linyong Huang; Zhe Zhang; Shuangchen Li; Dimin Niu; Yijin Guan; Hongzhong Zheng; Yuan Xie

doi:10.1109/ACCESS.2022.3169423

IEEE Access (Jan 2022)

Practical Near-Data-Processing Architecture for Large-Scale Distributed Graph Neural Network

Linyong Huang,
Zhe Zhang,
Shuangchen Li,
Dimin Niu,
Yijin Guan,
Hongzhong Zheng,
Yuan Xie

Affiliations

Linyong Huang: ORCiD; College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China
Zhe Zhang: Alibaba Group, Hangzhou, China
Shuangchen Li: Alibaba Group, Hangzhou, China
Dimin Niu: Alibaba Group, Hangzhou, China
Yijin Guan: Alibaba Group, Hangzhou, China
Hongzhong Zheng: Alibaba Group, Hangzhou, China
Yuan Xie: ORCiD; Alibaba Group, Hangzhou, China

DOI: https://doi.org/10.1109/ACCESS.2022.3169423
Journal volume & issue: Vol. 10
pp. 46796 – 46807

Abstract

Read online

Graph Neural Networks have drawn tremendous attention in the past few years due to their convincing performance and high interpretability in various graph-based tasks like link prediction and node classification. With the ever-growing graph size in the real world, especially for industrial graphs at a billion-level, the storage of graphs can easily consume Terabytes so that the process of GNNs has to be processed in a distributed manner. As a result, the execution could be inefficient due to the expensive cross-node communication and irregular memory access. Various GNN accelerators have been proposed for efficient GNN processing. They, however, mainly focused on small and medium-size graphs, which is not applicable to large-scale distributed graphs. In this paper, we present a practical Near-Data-Processing architecture based on a memory-pool system for large-scale distributed GNNs. We propose a customized memory fabric interface to construct the memory pool for low-latency and high throughput cross-node communication, which can provide flexible memory allocation and strong scalability. A practical Near-Data-Processing design is proposed for efficient work offloading and bandwidth utilization improvement. Moreover, we also introduce a partition and scheduling scheme to further improve performance and achieve workload balance. Comprehensive evaluations demonstrate that the proposed architecture can achieve up to $27\times $ and $8\times $ higher training speed compared to two state-of-the-art distributed GNN frameworks: Deep Graph Library and $P^{3}$ , respectively.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords