FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-performance Computing Systems

Yanchao Zhu; Yi Liu; Guozhen Zhang

doi:10.1109/ACCESS.2020.2975832

IEEE Access (Jan 2020)

FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-performance Computing Systems

Yanchao Zhu,
Yi Liu,
Guozhen Zhang

Affiliations

Yanchao Zhu: ORCiD; School of Computer Science and Engineering, Sino-German Joint Software Institute, Beihang University, Beijing, China
Yi Liu: ORCiD; School of Computer Science and Engineering, Sino-German Joint Software Institute, Beihang University, Beijing, China
Guozhen Zhang: ORCiD; School of Computer Science and Engineering, Sino-German Joint Software Institute, Beihang University, Beijing, China

DOI: https://doi.org/10.1109/ACCESS.2020.2975832
Journal volume & issue: Vol. 8
pp. 42674 – 42688

Abstract

Read online

As high-performance computing (HPC) systems have scaled up, resilience has become a great challenge. To guarantee resilience, various kinds of hardware and software techniques have been proposed. However, among popular software fault-tolerant techniques, both the checkpoint-restart approach and the replication technique face challenges of scalability in the era of peta- and exa-scale systems due to their numerous processes. In this situation, algorithm-based approaches, or algorithm-based fault tolerance (ABFT) mechanisms, have become attractive because they are efficient and lightweight. Although the ABFT technique is algorithm-dependent, it is possible to implement it at a low level (e.g., in libraries for basic numerical algorithms) and make it application-independent. However, previous ABFT approaches have mainly aimed at achieving fault tolerance in integrated circuits (ICs) or at the architecture level and are therefore not suitable for HPC systems; e.g., they use checksums of rows and columns of matrices rather than checksums of blocks to detect errors. Furthermore, they cannot deal with errors caused by node failure, which are common in current HPC systems. To solve these problems, this paper proposes FT-PBLAS, a PBLAS-based library for fault-tolerant parallel linear algebra computations that can be regarded as a fault-tolerant version of the parallel basic linear algebra subprograms (PBLAS), because it provides a series of fault-tolerant versions of interfaces in PBLAS. To support the underlying error detection and recovery mechanisms in the library, we propose a block-checksum approach for non-fatal errors and a scheme for addressing node failure, respectively. We evaluate two fault-tolerant mechanisms and FT-PBLAS on HPC systems, and the experimental results demonstrate the performance of our library.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords