IET Image Processing (Dec 2024)

Hierarchical multi‐modal video summarization with dynamic sampling

  • Lingjian Yu,
  • Xing Zhao,
  • Liang Xie,
  • Haoran Liang,
  • Ronghua Liang

DOI
https://doi.org/10.1049/ipr2.13269
Journal volume & issue
Vol. 18, no. 14
pp. 4577 – 4588

Abstract

Read online

Abstract Previous video summarization methods often neglected inter‐frame variations during the preprocessing stage. Sampling repeated frames can lead to information redundancy, while missing key frames can result in deviations in semantic comprehension and inaccuracies in the generated summaries. This work proposes a dynamic sampling module that leverages frame‐level motion information to alleviate these issues. The module conducts high‐frequency sampling during intervals with significant changes, allowing for a finer capture of details. Combined with a hierarchical multi‐modal structure, it integrates shot‐level visual and textual information to enhance the semantic understanding of video clips and improve the accuracy of the summarized content. Extensive experiments on benchmark datasets SumMe and TVSum demonstrate the effectiveness of the proposed method.

Keywords