Frontiers in Neurorobotics (Jan 2025)

Brain-inspired multimodal motion and fine-grained action recognition

  • Yuening Li,
  • Xiuhua Yang,
  • Changkui Chen

DOI
https://doi.org/10.3389/fnbot.2024.1502071
Journal volume & issue
Vol. 18

Abstract

Read online

IntroductionTraditional action recognition methods predominantly rely on a single modality, such as vision or motion, which presents significant limitations when dealing with fine-grained action recognition. These methods struggle particularly with video data containing complex combinations of actions and subtle motion variations.MethodsTypically, they depend on handcrafted feature extractors or simple convolutional neural network (CNN) architectures, which makes effective multimodal fusion challenging. This study introduces a novel architecture called FGM-CLIP (Fine-Grained Motion CLIP) to enhance fine-grained action recognition. FGM-CLIP leverages the powerful capabilities of Contrastive Language-Image Pretraining (CLIP), integrating a fine-grained motion encoder and a multimodal fusion layer to achieve precise end-to-end action recognition. By jointly optimizing visual and motion features, the model captures subtle action variations, resulting in higher classification accuracy in complex video data.Results and discussionExperimental results demonstrate that FGM-CLIP significantly outperforms existing methods on multiple fine-grained action recognition datasets. Its multimodal fusion strategy notably improves the model's robustness and accuracy, particularly for videos with intricate action patterns.

Keywords