An Efficient im2row-Based Fast Convolution Algorithm for ARM Cortex-M MCUs

Peng Wang; Xiaoqin Wang; Rui Luo; Dingyi Wang; Mengjie Luo; Shushan Qiao; Yumei Zhou

doi:10.1109/ACCESS.2021.3110827

IEEE Access (Jan 2021)

An Efficient im2row-Based Fast Convolution Algorithm for ARM Cortex-M MCUs

Peng Wang,
Xiaoqin Wang,
Rui Luo,
Dingyi Wang,
Mengjie Luo,
Shushan Qiao,
Yumei Zhou

Affiliations

Peng Wang: ORCiD; Institute of Microelectronics of Chinese Academy of Sciences, Beijing, Chaoyang, China
Xiaoqin Wang: Institute of Microelectronics of Chinese Academy of Sciences, Beijing, Chaoyang, China
Rui Luo: Institute of Microelectronics of Chinese Academy of Sciences, Beijing, Chaoyang, China
Dingyi Wang: Institute of Microelectronics of Chinese Academy of Sciences, Beijing, Chaoyang, China
Mengjie Luo: Institute of Microelectronics of Chinese Academy of Sciences, Beijing, Chaoyang, China
Shushan Qiao: ORCiD; Institute of Microelectronics of Chinese Academy of Sciences, Beijing, Chaoyang, China
Yumei Zhou: Institute of Microelectronics of Chinese Academy of Sciences, Beijing, Chaoyang, China

DOI: https://doi.org/10.1109/ACCESS.2021.3110827
Journal volume & issue: Vol. 9
pp. 124384 – 124395

Abstract

Read online

With the rise of IoT and edge computing, deploying neural networks (NNs) on low-power edge computing devices is drawing more and more attention. In NNs, convolutional layers take up the majority of the computing cycles, especially when NNs are implemented on ARM processors. Therefore, it is necessary to optimize the convolutional implementation on ARM Cortex-M MCUs. This paper proposes an efficient im2row-based fast convolution algorithm with two innovations. First, a novel im2row method for reusing the data of adjacent convolutional windows is presented. This method utilizes a reusable im2row buffer for data reuse, significantly reducing the amount of data copied during im2row and improving efficiency. Second, in algorithm implementation, a q7_t to q15_t data type extension technique that avoids data reordering is employed. This technique eliminates data reordering instructions, thus reducing the runtime of the algorithm. We evaluate our algorithm in separate convolutional layers and NNs. The results for convolutional layers show that, compared to baseline, the proposed algorithm speeds up the convolutional layer by an average of $1.42\times $ , and the maximum speedup is up to $2.9\times $ . Experiments on different NNs demonstrate that our algorithm can speed up the overall NN by up to $2.15\times $ .

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords