Tongxin xuebao (Feb 2020)
Video description method based on multidimensional and multimodal information
Abstract
In order to solve the problem of complex information representation in automatic video description tasks,a multi-dimensional and multi-modal visual feature extraction and fusion method was proposed.Firstly,multi-dimensional features such as static and dynamic attributes of the video sequence were extracted by transfer learning,and the image description algorithm was also used to extract the semantic information of the key frames in the video.By doing this,the video features extraction was carried out.Then,multi-layer long and short memory networks were used to fuse multi-dimensional and multi-modal information,and finally generated a language description of the video content.Compared with the existing methods,experimental simulations results show that the proposed method achieves better results in the video automatic description task.