Semantic guidance network for video captioning

Lan Guo; Hong Zhao; ZhiWen Chen; ZeYu Han

doi:10.1038/s41598-023-43010-3

Scientific Reports (Sep 2023)

Semantic guidance network for video captioning

Lan Guo,
Hong Zhao,
ZhiWen Chen,
ZeYu Han

Affiliations

Lan Guo: School of Computer and Communication, Lanzhou University of Technology
Hong Zhao: School of Computer and Communication, Lanzhou University of Technology
ZhiWen Chen: School of Computer and Communication, Lanzhou University of Technology
ZeYu Han: Network and Information Center, Lanzhou University of Technology

DOI: https://doi.org/10.1038/s41598-023-43010-3
Journal volume & issue: Vol. 13, no. 1
pp. 1 – 19

Abstract

Read online

Abstract video captioning is a more challenging task that aims to generate abundant natural language descriptions, and it has become a promising direction for artificial intelligence. However, most existing methods are prone to ignore the problems of visual information redundancy and scene information omission due to the limitation of the sampling strategies. To address this problem, a semantic guidance network for video captioning is proposed. More specifically, a novel scene frame sampling strategy is first proposed to select key scene frames. Then, the vision transformer encoder is applied to learn visual and semantic information with a global view, alleviating information loss of modeling long-range dependencies caused in the encoder’s hidden layer. Finally, a non-parametric metric learning module is introduced to calculate the similarity value between the ground truth and the prediction result, and the model is optimized in an end-to-end manner. Experiments on the benchmark MSR-VTT and MSVD datasets show that the proposed method can effectively improve the description accuracy and generalization ability.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal