IEEE Access (Jan 2020)
Two-Stage Column Block Parallel LU Factorization Algorithm
Abstract
Parallel computing is increasingly important in computer architectures, parallel architecture has become ubiquitous in our everyday lives. Novel architectures and programming models pose new challenges to algorithm design and system software development. This paper presents a two-stage column block parallel LU factorization algorithm for multiple-processor architectures. Any given matrix is first partitioned into large blocks, and then, every large block is partitioned into a number of small blocks according to the number of processors. Finally, the small column blocks are allocated to processors in an orderly “serpentine arrangement.” Each iteration of the column block parallel LU factorization is separated into two stages of operation. In the first stage, the first-step factorization operation is processed in advance and nonblocking communication is used to reduce the processor idle and waiting time and improve parallelism. In the second stage, the large blocks are used to satisfy more powerful processors, such as GPUs, which require more data to exploit their computing capabilities. Experiments are conducted on a multicore system and multi-GPU system with different configurations to test the algorithm's performance. Compared with other related column block parallel LU factorizations, the two-stage algorithm exhibits better load balancing and parallel execution time performance.
Keywords