IEEE Access (Jan 2021)
p-im2col: Simple Yet Efficient Convolution Algorithm With Flexibly Controlled Memory Overhead
Abstract
Convolution is the most time-consuming operation in modern deep artificial neural networks, so its performance is crucial for fast inference. One of the standard approaches to fast convolution computation is to use GeMM-based convolution algorithms relying on efficient general matrix multiplication (GeMM) from optimized BLAS libraries. However, commonly used GeMM-based algorithms may cause significant memory overhead or avoid it only at the cost of worse performance. In this paper, we propose a novel convolution algorithm, p-im2col, based on a well-known im2col algorithm that avoids memory overhead by splitting a single multiplication of a large matrix into several multiplications of smaller matrices. We theoretically and experimentally compare our algorithm with two other GeMM-based algorithms: im2col, which is widely used as a baseline, and the memory-efficient kn2row-aa. We measure the inference time of these algorithms on central processing units of x86, x86_64, ARM, and MIPS architectures for a large set of convolutional parameters. The proposed algorithm demonstrates a speedup over im2col and kn2row-aa in a number of cases and a significant reduction in additional memory requirements compared to im2col. Based on our experiments, we present a new convolution algorithm selection scheme that considers memory restrictions, CPU architecture, and convolutional parameters and provides a noticeable advantage over each particular algorithm.
Keywords