A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization With Partial Pivoting

Sandra Catalan; Jose R. Herrero; Enrique S. Quintana-Orti; Rafael Rodriguez-Sanchez; Robert Van De Geijn

doi:10.1109/ACCESS.2019.2895541

IEEE Access (Jan 2019)

A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization With Partial Pivoting

Sandra Catalan,
Jose R. Herrero,
Enrique S. Quintana-Orti,
Rafael Rodriguez-Sanchez,
Robert Van De Geijn

Affiliations

Sandra Catalan: Departarmento Ingeniería y Ciencia de Computadores, Universidad Jaume I, Castellón de la Plana, Spain
Jose R. Herrero: ORCiD; Departamento d’Arquitectura de Computadors, Universitat Politècnica de Catalunya, Barcelona, Spain
Enrique S. Quintana-Orti: Departarmento Ingeniería y Ciencia de Computadores, Universidad Jaume I, Castellón de la Plana, Spain
Rafael Rodriguez-Sanchez: Departamento de Arquitectura de Computadores y Automática, Universidad Complutense de Madrid, Madrid, Spain
Robert Van De Geijn: Department of Computer Science, Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin, TX, USA

DOI: https://doi.org/10.1109/ACCESS.2019.2895541
Journal volume & issue: Vol. 7
pp. 17617 – 17633

Abstract

Read online

We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first technique promotes worker sharing (WS) between the two tasks, allowing the threads of the task that completes first to be reallocated for use by the costlier task. The second technique allows a fast task to alert the slower task of completion, enforcing the early termination (ET) of the second task, and a smooth transition of the factorization procedure into the next iteration. The two mechanisms are instantiated via a new malleable thread-level implementation of the basic linear algebra subprograms, and their benefits are illustrated via an implementation of the LU factorization with partial pivoting enhanced with look-ahead. Concretely, our experimental results on an Intel-Xeon system with 12 cores show the benefits of combining WS+ET, reporting competitive performance in comparison with a task-parallel runtime-based solution.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords