Applied Sciences (Mar 2025)

Video Temporal Grounding with Multi-Model Collaborative Learning

  • Yun Tian,
  • Xiaobo Guo,
  • Jinsong Wang,
  • Bin Li,
  • Shoujun Zhou

DOI
https://doi.org/10.3390/app15063072
Journal volume & issue
Vol. 15, no. 6
p. 3072

Abstract

Read online

Given an untrimmed video and a natural language query, the video temporal grounding task aims to accurately locate the target segment within the video. Functioning as a critical conduit between computer vision and natural language processing, this task holds profound importance in advancing video comprehension. Current research predominantly centers on enhancing the performance of individual models, thereby overlooking the extensive possibilities afforded by multi-model synergy. While knowledge flow methods have been adopted for multi-model and cross-modal collaborative learning, several critical concerns persist, including the unidirectional transfer of knowledge, low-quality pseudo-label generation, and gradient conflicts inherent in cooperative training. To address these issues, this research proposes a Multi-Model Collaborative Learning (MMCL) framework. By incorporating a bidirectional knowledge transfer paradigm, the MMCL framework empowers models to engage in collaborative learning through the interchange of pseudo-labels. Concurrently, the mechanism for generating pseudo-labels is optimized using the CLIP model’s prior knowledge, bolstering both the accuracy and coherence of these labels while efficiently discarding extraneous temporal fragments. The framework also integrates an iterative training algorithm for multi-model collaboration, mitigating gradient conflicts through alternate optimization and achieving a dynamic balance between collaborative and independent learning. Empirical evaluations across multiple benchmark datasets indicate that the MMCL framework markedly elevates the performance of video temporal grounding models, exceeding existing state-of-the-art approaches in terms of mIoU and Rank@1. Concurrently, the framework accommodates both homogeneous and heterogeneous model configurations, demonstrating its broad versatility and adaptability. This investigation furnishes an effective avenue for multi-model collaborative learning in video temporal grounding, bolstering efficient knowledge dissemination and charting novel pathways in the domain of video comprehension.

Keywords