LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference

Dong Chen; Dong Chen; Dong Chen; Peisong Wu; Peisong Wu; Mingdong Chen; Mingdong Chen; Mengtao Wu; Mengtao Wu; Tao Zhang; Tao Zhang; Chuanqi Li

doi:10.3389/fnbot.2024.1457843

Frontiers in Neurorobotics (Oct 2024)

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference

Dong Chen,
Dong Chen,
Dong Chen,
Peisong Wu,
Peisong Wu,
Mingdong Chen,
Mingdong Chen,
Mengtao Wu,
Mengtao Wu,
Tao Zhang,
Tao Zhang,
Chuanqi Li

Affiliations

Dong Chen: College of Physics and Electronic Engineering, Nanning Normal University, Nanning, China
Dong Chen: College of Computer Science and Engineering, Guangxi Normal University, Guilin, China
Dong Chen: Guangxi Key Laboratory of Functional Information Materials and Intelligent Information Processing, Nanning, China
Peisong Wu: College of Physics and Electronic Engineering, Nanning Normal University, Nanning, China
Peisong Wu: Guangxi Key Laboratory of Functional Information Materials and Intelligent Information Processing, Nanning, China
Mingdong Chen: College of Physics and Electronic Engineering, Nanning Normal University, Nanning, China
Mingdong Chen: Guangxi Key Laboratory of Functional Information Materials and Intelligent Information Processing, Nanning, China
Mengtao Wu: College of Physics and Electronic Engineering, Nanning Normal University, Nanning, China
Mengtao Wu: Guangxi Key Laboratory of Functional Information Materials and Intelligent Information Processing, Nanning, China
Tao Zhang: College of Physics and Electronic Engineering, Nanning Normal University, Nanning, China
Tao Zhang: Guangxi Key Laboratory of Functional Information Materials and Intelligent Information Processing, Nanning, China
Chuanqi Li: College of Computer Science and Engineering, Guangxi Normal University, Guilin, China

DOI: https://doi.org/10.3389/fnbot.2024.1457843
Journal volume & issue: Vol. 18

Abstract

Read online

Over the past few years, a growing number of researchers have dedicated their efforts to focusing on temporal modeling. The advent of transformer-based methods has notably advanced the field of 2D image-based vision tasks. However, with respect to 3D video tasks such as action recognition, applying temporal transformations directly to video data significantly increases both computational and memory demands. This surge in resource consumption is due to the multiplication of data patches and the added complexity of self-aware computations. Accordingly, building efficient and precise 3D self-attentive models for video content represents as a major challenge for transformers. In our research, we introduce an Long and Short-term Temporal Difference Vision Transformer (LS-VIT). This method incorporates short-term motion details into images by weighting the difference across several consecutive frames, thereby equipping the original image with the ability to model short-term motions. Concurrently, we integrate a module designed to understand long-term motion details. This module enhances the model's capacity for long-term motion modeling by directly integrating temporal differences from various segments via motion excitation. Our thorough analysis confirms that the LS-VIT achieves high recognition accuracy across multiple benchmarks (e.g., UCF101, HMDB51, Kinetics-400). These research results indicate that LS-VIT has the potential for further optimization, which can improve real-time performance and action prediction capabilities.

Published in Frontiers in Neurorobotics

ISSN: 1662-5218 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Medicine: Internal medicine: Neurosciences. Biological psychiatry. Neuropsychiatry
Website: https://www.frontiersin.org/journals/neurorobotics/

About the journal

Abstract

Keywords