Journal of Imaging (Sep 2024)

Unleashing the Power of Contrastive Learning for Zero-Shot Video Summarization

  • Zongshang Pang,
  • Yuta Nakashima,
  • Mayu Otani,
  • Hajime Nagahara

DOI
https://doi.org/10.3390/jimaging10090229
Journal volume & issue
Vol. 10, no. 9
p. 229

Abstract

Read online

Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. Past efforts have invariantly involved training summarization models with annotated summaries or heuristic objectives. In this work, we reveal that features pre-trained on image-level tasks contain rich semantic information that can be readily leveraged to quantify frame-level importance for zero-shot video summarization. Leveraging pre-trained features and contrastive learning, we propose three metrics featuring a desirable keyframe: local dissimilarity, global consistency, and uniqueness. We show that the metrics can well-capture the diversity and representativeness of frames commonly used for the unsupervised generation of video summaries, demonstrating competitive or better performance compared to past methods when no training is needed. We further propose a contrastive learning-based pre-training strategy on unlabeled videos to enhance the quality of the proposed metrics and, thus, improve the evaluated performance on the public benchmarks TVSum and SumMe.

Keywords