Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism

ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin

doi:10.11896/jsjkx.210500224

Jisuanji kexue (Jul 2022)

Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism

ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin

Affiliations

ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin: Sixty-third Research Institute,National University of Defense Technology,Nanjing 210007,China

DOI: https://doi.org/10.11896/jsjkx.210500224
Journal volume & issue: Vol. 49, no. 7
pp. 106 – 112

Abstract

Read online

With the rapid development of mobile network and we-media platform,lots of video and text information are generated,which bring an urgent demand for video-text cross-modal entity resolution.In order to improve the performance of video-text cross-modal entity resolution,a novel fine-grained semantic association video-text cross-model entity resolution model based on attention mechanism(FSAAM) is proposed.For each frame in video,the feature information is extracted by the image feature extraction network as a feature representation,which will be fine-tuned by the fully connected network and mapped to a common space.At the same time,the words in the text description are vectorized by word embedding,and mapped to a common space by the bi-directional recurrent neural network.On this basis,an adaptive fine-grained video-text semantic association method is proposed to calculate the similarity between each word in text and the frame in video.The attention mechanism is used for weighted summation to obtain the semantic similarity between the frame in video and the text description,and frames with small semantic similarity with the text are filtered to improve the model's performance.FSAAM mainly solves the problem that there is a great quantity of redundant information in video and a large number of words with little contribution in text,and it is difficult to construct video-text semantic association due to the different degree of association between words and frames.Experiments on MSR-VTT and VATEX datasets demonstrate the superiority of the proposed method.

cross-modal entity resolution|common space|attention mechanism|fine granularity|semantic similarity|feature extraction

Published in Jisuanji kexue

ISSN: 1002-137X (Print)
Publisher: Editorial office of Computer Science
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software; Technology: Technology (General)
Website: http://www.jsjkx.com/CN/1002-137X/home.shtml

About the journal

Abstract

Keywords