IEEE Access (Jan 2019)
Spatiotemporal Relation Networks for Video Action Recognition
Abstract
Two-stream convolutional networks have shown strong performance in a video action recognition task for its ability to capture spatial and temporal features simultaneously. However, the calculation of optical flow is time-consuming and it cannot be applied to the real-time processing of video. To address this problem, this paper proposes a new end-to-end architecture called SpatioTemporal Relation Networks (STRN) to extract spatial information and temporal information simultaneously from the video with the only RGB input. STRN consist of two branches, called appearance stream and motion stream, respectively. Appearance stream retains the structure of the original spatial stream in the two-stream architecture with the input of consecutive frames instead of a single frame. Motion stream, which takes relation information between the adjacent features in the appearance stream as an input, can effectively complement appearance stream. A relation block is an extractor which is used to extract relation information from the appearance stream. STRN can learn spatiotemporal information from the video with the only RGB input, which avoids the calculation of optical flow. We validate the STRN on UCF-101 and HMDB-51 and achieve better performance.
Keywords