IEEE Access (Jan 2024)

Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition

  • Nguyen Anh Tu,
  • Nartay Aikyn,
  • Nursultan Makhanov,
  • Assanali Abu,
  • Kok-Seng Wong,
  • Min-Ho Lee

DOI
https://doi.org/10.1109/ACCESS.2024.3519254
Journal volume & issue
Vol. 12
pp. 193141 – 193164

Abstract

Read online

Few-shot action recognition aims to train a model to classify actions in videos using only a few examples, known as “shots,” per action class. This learning approach is particularly useful but challenging due to the limited availability of labeled video data in practice. Although significant progress has been made in developing few-shot learners, existing methods still face several limitations. Firstly, current methods have not sufficiently explored the effectiveness of 3D feature extractors (e.g., 3D CNNs or Video Transformers), thereby failing to exploit spatiotemporal dynamics in videos. Secondly, the need for a large video dataset to train the model in a centralized manner raises privacy concerns and results in high storage costs and communication overheads. Thirdly, the existing solutions based on local deployment lack the capability to benefit global prior knowledge from a wide variety of real-world action samples. To address these limitations, we propose a federated learning (FL) framework named FedFSLAR++ to collaboratively train few-shot learners with 3D feature extractors. Specifically, we perform few-shot action recognition tasks under FL settings, enhancing privacy protection while maintaining efficient communication and storage. Moreover, FL allows us to effectively learn meta-knowledge from a large set of action videos among heterogeneous clients. Within our framework, we establish a unified benchmark to systematically and fairly compare different components, including feature extraction, meta-learning, and FL for model update and aggregation. This type of benchmark is still lacking in the literature. Notably, we thoroughly examine six 3D CNN and Transformer models for extracting spatiotemporal video features needed to adapt to new tasks quickly during the meta-learning process. We further propose a hybrid feature extractor that combines the advantages of 3D CNNs and Transformers to produce strong video representations. Additionally, we explore three meta-learning paradigms and three FL algorithms to investigate their effectiveness and suggest the optimal choices for performance improvement. Results from extensive experiments on four action datasets verify the robustness of the FedFSLAR++ framework. Our comprehensive study provides a solid foundation for future research advancements in action recognition.

Keywords