IEEE Access (Jan 2020)

Efficient Parallel Inflated 3D Convolution Architecture for Action Recognition

  • Yukun Huang,
  • Yongcai Guo,
  • Chao Gao

DOI
https://doi.org/10.1109/ACCESS.2020.2978223
Journal volume & issue
Vol. 8
pp. 45753 – 45765

Abstract

Read online

Deep neural networks have received increasing attention in human action recognition. Previous research has established that utilizing 3D convolution is a reasonable approach to learn spatio-temporal representation. Nevertheless, constructing effective 3D ConvNets usually need an expensive pre-training process that performing on a huge-scale video dataset. To avoid this burdensome situation, one major issue is to determine whether the pre-trained parameters of 2D convolution networks can be directly bootstrapped into 3D. In this paper, we devise a 2D-Inflated operation and a parallel 3D ConvNet architecture to solve this problem. The 2D-Inflated operation is used for converting pre-trained 2D ConvNets into 3D ConvNets, which avoiding video data pre-training. We further explore the optimal quantity of 3D ConvNet in the parallel architecture, and the results suggest that 6-nets architecture is an excellent solution for recognition. Another contribution of our study is two practical and valid skills, accumulated gradient descent and video sequence decomposition. Either of those techniques can promote the improvement of performance. The recognition results of UCF101 and HMDB51 reveal that, without the video data pre-training, our 3D ConvNets still can achieve competitive performance to the other generic and recent methods of using 3D ConvNets in the RGB image domain.

Keywords