Optimizing Spatiotemporal Feature Learning in 3D Convolutional Neural Networks With Pooling Blocks

Rockson Agyeman; Muhammad Rafiq; Hyun Kwang Shin; Bernhard Rinner; Gyu Sang Choi

doi:10.1109/ACCESS.2021.3078295

IEEE Access (Jan 2021)

Optimizing Spatiotemporal Feature Learning in 3D Convolutional Neural Networks With Pooling Blocks

Rockson Agyeman,
Muhammad Rafiq,
Hyun Kwang Shin,
Bernhard Rinner,
Gyu Sang Choi

Affiliations

Rockson Agyeman: ORCiD; Institute of Networked and Embedded Systems, University of Klagenfurt, Klagenfurt am Wörthersee, Austria
Muhammad Rafiq: ORCiD; Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, South Korea
Hyun Kwang Shin: Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, South Korea
Bernhard Rinner: ORCiD; Institute of Networked and Embedded Systems, University of Klagenfurt, Klagenfurt am Wörthersee, Austria
Gyu Sang Choi: ORCiD; Department of Information and Communication Engineering, Yeungnam University, Gyeongsan, South Korea

DOI: https://doi.org/10.1109/ACCESS.2021.3078295
Journal volume & issue: Vol. 9
pp. 70797 – 70805

Abstract

Read online

Image data contain spatial information only, thus making two-dimensional (2D) Convolutional Neural Networks (CNN) ideal for solving image classification problems. On the other hand, video data contain both spatial and temporal information that must be simultaneously analyzed to solve action recognition problems. 3D CNNs are successfully used for these tasks, but they suffer from their extensive inherent parameter set. Increasing the network’s depth, as is common among 2D CNNs, and hence increasing the number of trainable parameters does not provide a good trade-off between accuracy and complexity of the 3D CNN. In this work, we propose Pooling Block (PB) as an enhanced pooling operation for optimizing action recognition by 3D CNNs. PB comprises three kernels of different sizes. The three kernels simultaneously sub-sample feature maps, and the outputs are concatenated into a single output vector. We compare our approach with three benchmark 3D CNNs (C3D, I3D, and Asymmetric 3D CNN) and three datasets (HMDB51, UCF101, and Kinetics 400). Our PB method yields significant improvement in 3D CNN performance with a comparatively small increase in the number of trainable parameters. We further investigate (1) the effect of video frame dimension and (2) the effect of the number of video frames on the performance of 3D CNNs using C3D as the benchmark.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords