Advances in Electrical and Computer Engineering (Aug 2023)
Video Moment Localization Network Based on Text Multi-semantic Clues Guidance
Abstract
With the rapid development of the Internet and information technology, people are able to create multimedia data such as pictures or videos anytime and anywhere. Efficient multimedia processing tools are needed for the vast video data. The video moment localization task aims to locate the video moment which best matches the query in the untrimmed video. Existing text-guided methods only consider single-scale text features, which cannot fully represent the semantic features of text, and also do not consider the masking of crucial information in the video by text information when using text to guide the extraction of video features. To solve the above problems, we propose a video moment localization network based on text multi-semantic clues guidance. Specifically, we first design a text encoder based on fusion gate to better capture the semantic information in the text through multi-semantic clues composed of word embedding, local features and global features. Then text guidance module guides the extraction of video features by text semantic features to highlight the video features related to text semantics. Experimental results on two datasets, Charades-STA and ActivityNet Captions, show that our approach provides significant improvements over state-of-the-art methods.
Keywords