Frontiers in Neurorobotics (Oct 2024)

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference

  • Dong Chen,
  • Dong Chen,
  • Dong Chen,
  • Peisong Wu,
  • Peisong Wu,
  • Mingdong Chen,
  • Mingdong Chen,
  • Mengtao Wu,
  • Mengtao Wu,
  • Tao Zhang,
  • Tao Zhang,
  • Chuanqi Li

DOI
https://doi.org/10.3389/fnbot.2024.1457843
Journal volume & issue
Vol. 18

Abstract

Read online

Over the past few years, a growing number of researchers have dedicated their efforts to focusing on temporal modeling. The advent of transformer-based methods has notably advanced the field of 2D image-based vision tasks. However, with respect to 3D video tasks such as action recognition, applying temporal transformations directly to video data significantly increases both computational and memory demands. This surge in resource consumption is due to the multiplication of data patches and the added complexity of self-aware computations. Accordingly, building efficient and precise 3D self-attentive models for video content represents as a major challenge for transformers. In our research, we introduce an Long and Short-term Temporal Difference Vision Transformer (LS-VIT). This method incorporates short-term motion details into images by weighting the difference across several consecutive frames, thereby equipping the original image with the ability to model short-term motions. Concurrently, we integrate a module designed to understand long-term motion details. This module enhances the model's capacity for long-term motion modeling by directly integrating temporal differences from various segments via motion excitation. Our thorough analysis confirms that the LS-VIT achieves high recognition accuracy across multiple benchmarks (e.g., UCF101, HMDB51, Kinetics-400). These research results indicate that LS-VIT has the potential for further optimization, which can improve real-time performance and action prediction capabilities.

Keywords