Jisuanji kexue (Jul 2022)

Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism

  • ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin

DOI
https://doi.org/10.11896/jsjkx.210500224
Journal volume & issue
Vol. 49, no. 7
pp. 106 – 112

Abstract

Read online

With the rapid development of mobile network and we-media platform,lots of video and text information are generated,which bring an urgent demand for video-text cross-modal entity resolution.In order to improve the performance of video-text cross-modal entity resolution,a novel fine-grained semantic association video-text cross-model entity resolution model based on attention mechanism(FSAAM) is proposed.For each frame in video,the feature information is extracted by the image feature extraction network as a feature representation,which will be fine-tuned by the fully connected network and mapped to a common space.At the same time,the words in the text description are vectorized by word embedding,and mapped to a common space by the bi-directional recurrent neural network.On this basis,an adaptive fine-grained video-text semantic association method is proposed to calculate the similarity between each word in text and the frame in video.The attention mechanism is used for weighted summation to obtain the semantic similarity between the frame in video and the text description,and frames with small semantic similarity with the text are filtered to improve the model's performance.FSAAM mainly solves the problem that there is a great quantity of redundant information in video and a large number of words with little contribution in text,and it is difficult to construct video-text semantic association due to the different degree of association between words and frames.Experiments on MSR-VTT and VATEX datasets demonstrate the superiority of the proposed method.

Keywords