Journal of Imaging (Dec 2024)

Multi-Level Feature Fusion in CNN-Based Human Action Recognition: A Case Study on EfficientNet-B7

  • Pitiwat Lueangwitchajaroen,
  • Sitapa Watcharapinchai,
  • Worawit Tepsan,
  • Sorn Sooksatra

DOI
https://doi.org/10.3390/jimaging10120320
Journal volume & issue
Vol. 10, no. 12
p. 320

Abstract

Read online

Accurate human action recognition is becoming increasingly important across various fields, including healthcare and self-driving cars. A simple approach to enhance model performance is incorporating additional data modalities, such as depth frames, point clouds, and skeleton information, while previous studies have predominantly used late fusion techniques to combine these modalities, our research introduces a multi-level fusion approach that combines information at early, intermediate, and late stages together. Furthermore, recognizing the challenges of collecting multiple data types in real-world applications, our approach seeks to exploit multimodal techniques while relying solely on RGB frames as the single data source. In our work, we used RGB frames from the NTU RGB+D dataset as the sole data source. From these frames, we extracted 2D skeleton coordinates and optical flow frames using pre-trained models. We evaluated our multi-level fusion approach with EfficientNet-B7 as a case study, and our methods demonstrated significant improvement, achieving 91.5% in NTU RGB+D 60 dataset accuracy compared to single-modality and single-view models. Despite their simplicity, our methods are also comparable to other state-of-the-art approaches.

Keywords