IEEE Access (Jan 2021)

Complete Video-Level Representations for Action Recognition

  • Min Li,
  • Ruwen Bai,
  • Bo Meng,
  • Junxing Ren,
  • Miao Jiang,
  • Yang Yang,
  • Linghan Li,
  • Hong Du

DOI
https://doi.org/10.1109/ACCESS.2021.3058998
Journal volume & issue
Vol. 9
pp. 92134 – 92142

Abstract

Read online

In most of the existing work for activity recognition, 3D ConvNets show promising performance for learning spatiotemporal features of videos. However, most methods sample fixed-length frames from the original video, which are cropped to a fixed size and fed into the model for training. In this manner, two problems limit the model performance for recognition. First, the cropped video clips are incomplete or even distorted in appearance, resulting in a large gap between the feature representation and semantics of human activity. Second, the useful features of longer video frame sequences are weakened by the repeated stacking of 3D convolution over deep networks due to the limitations of GPU memory and computing ability. This article proposes a method based on a 3D backbone network for multi scale spatial feature representation, which uses a pyramid pooling layer to allow the input of video frames at different scales, and then aggregates short-term spatial–temporal features into a long-term video-level representation. Objection detection is used as a component of model testing to explore the improvement of activity recognition considering the large amount of space–time redundancy in real life videos. An experiment is performed on the principal video dataset, UCF101, and the proposed method presents a competitive performance.

Keywords