Journal of Universal Computer Science (Sep 2024)

Recognition of Real-Time Video Activities Using Stacked Bi-GRU with Fusion-based Deep Architecture

  • Ujwala Thakur,
  • Ankit Vidyarthi,
  • Amarjeet Prajapati

DOI
https://doi.org/10.3897/jucs.113095
Journal volume & issue
Vol. 30, no. 10
pp. 1424 – 1452

Abstract

Read online Read online Read online

Recognizing and understanding human activities in real-time videos is a challenging task due to the complex nature of video data and the need for efficient and accurate analysis. This research pioneers a breakthrough in video activity recognition by introducing a robust framework leveraging the power of a stacked Bidirectional Long Short-Term Memory (Bi-LSTM) and Gated Recurrent Unit (GRU) architecture, harmonized within a fusion-based deep model. The stacked Bi-LSTM-GRU model capitalizes on its dual recurrent architecture, capturing nuanced temporal dependencies within video sequences. The fusion-based deep architecture synergizes spatial and temporal features, enabling the model to discern intricate patterns in human activities. To further enhance the discriminative power of the model, we introduce a fusion module in the proposed deep architecture. The fusion module integrates multi-modal features extracted from different levels of the network hierarchy, allowing for a more comprehensive representation of video activities. We demonstrate the efficacy of our approach through rigorous experimentation on UCF50, UCF101, and HMDB51 datasets. In experiments on the UCF50 dataset, our model achieves an accuracy of 97.01% and 95.86% on training and validation sets respectively, showcasing its proficiency in discerning activities across a diverse range of scenarios. The evaluation extends to the UCF101 dataset, where the proposed approach achieves a competitive accuracy of 97.62% and 96.93% on training and validation sets, surpassing previous benchmarks by a margin of approx 1%. Further-more, on the challenging HMDB51 dataset, the model demonstrates a robust accuracy of 89.71%and 88.88% on training and validation sets, solidifying its efficacy in intricate action recognition tasks.

Keywords