IEEE Access (Jan 2023)
Feature Fusion for Dual-Stream Cooperative Action Recognition
Abstract
Currently, the primary methods for action recognition involve RGB-based approaches, pose-based approaches (e.g., skeleton coordinates), and multi-stream fusion methods. In this paper, we propose a novel action recognition framework based on both RGB images and motion pose images to enhance the accuracy of action recognition in videos. As a single feature representation fail to effectively capture motion trends and image variation information, it cannot accurately reflect expected action judgments in real-world scenarios. Therefore, we utilize the appearance features of video frames and the motion variation features of the subject, aiming to cooperate the action itself with appearance information for precise action recognition. We construct video representations based on local spatiotemporal features and global features, and utilize the ResNet backbone network and Temporal Shift Module (TSM) to extract action representations from multi-stream information. Driven by the motion features, the fusion of multi-stream information achieves effective expression of motion features. Experimental results on public datasets demonstrate the effectiveness of our proposed method. It achieves competitive performance compared to state-of-the-art techniques while maintaining a less complex and more interpretable model. Overall, our approach demonstrates superior effectiveness.
Keywords