IEEE Access (Jan 2019)

Spatiotemporal Relation Networks for Video Action Recognition

  • Zheng Liu,
  • Haifeng Hu

DOI
https://doi.org/10.1109/ACCESS.2019.2894025
Journal volume & issue
Vol. 7
pp. 14969 – 14976

Abstract

Read online

Two-stream convolutional networks have shown strong performance in a video action recognition task for its ability to capture spatial and temporal features simultaneously. However, the calculation of optical flow is time-consuming and it cannot be applied to the real-time processing of video. To address this problem, this paper proposes a new end-to-end architecture called SpatioTemporal Relation Networks (STRN) to extract spatial information and temporal information simultaneously from the video with the only RGB input. STRN consist of two branches, called appearance stream and motion stream, respectively. Appearance stream retains the structure of the original spatial stream in the two-stream architecture with the input of consecutive frames instead of a single frame. Motion stream, which takes relation information between the adjacent features in the appearance stream as an input, can effectively complement appearance stream. A relation block is an extractor which is used to extract relation information from the appearance stream. STRN can learn spatiotemporal information from the video with the only RGB input, which avoids the calculation of optical flow. We validate the STRN on UCF-101 and HMDB-51 and achieve better performance.

Keywords