IET Image Processing (Dec 2024)
Hierarchical multi‐modal video summarization with dynamic sampling
Abstract
Abstract Previous video summarization methods often neglected inter‐frame variations during the preprocessing stage. Sampling repeated frames can lead to information redundancy, while missing key frames can result in deviations in semantic comprehension and inaccuracies in the generated summaries. This work proposes a dynamic sampling module that leverages frame‐level motion information to alleviate these issues. The module conducts high‐frequency sampling during intervals with significant changes, allowing for a finer capture of details. Combined with a hierarchical multi‐modal structure, it integrates shot‐level visual and textual information to enhance the semantic understanding of video clips and improve the accuracy of the summarized content. Extensive experiments on benchmark datasets SumMe and TVSum demonstrate the effectiveness of the proposed method.
Keywords