Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition

Mei  Chee Leong; Dilip  K. Prasad; Yong  Tsui Lee; Feng Lin

doi:10.3390/app10020557

Applied Sciences (Jan 2020)

Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition

Mei Chee Leong,
Dilip K. Prasad,
Yong Tsui Lee,
Feng Lin

Affiliations

Mei Chee Leong: Institute for Media Innovation, Interdisciplinary Graduate School, Nanyang Technological University, Singapore 639798, Singapore
Dilip K. Prasad: Department of Computer Science, UiT The Artic University of Norway, 9019 Tromsø, Norway
Yong Tsui Lee: School of Mechanical and Aerospace Engineering, Nanyang Technological University, Singapore 639798, Singapore
Feng Lin: School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798, Singapore

DOI: https://doi.org/10.3390/app10020557
Journal volume & issue: Vol. 10, no. 2
p. 557

Abstract

Read online

This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D convolutional neural networks (CNNs), 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is to fuse the convolution layers from 2D and 3D CNNs to allow temporal encoding with fewer parameters than 3D CNNs. We adopt transfer learning from pre-trained 2D CNNs for spatial extraction, followed by temporal encoding, before connecting to 3D convolution layers at the top of the architecture. We construct our fusion architecture, semi-CNN, based on three popular models: VGG-16, ResNets and DenseNets, and compare the performance with their corresponding 3D models. Our empirical results evaluated on the action recognition dataset UCF-101 demonstrate that our fusion of 1D, 2D and 3D convolutions outperforms its 3D model of the same depth, with fewer parameters and reduces overfitting. Our semi-CNN architecture achieved an average of 16−30% boost in the top-1 accuracy when evaluated on an input video of 16 frames.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords