Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

Shixing Han; Jin Liu; Jinyingming Zhang; Peizhu Gong; Xiliang Zhang; Huihua He

doi:10.1007/s40747-023-00998-5

Complex & Intelligent Systems (Feb 2023)

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

Shixing Han,
Jin Liu,
Jinyingming Zhang,
Peizhu Gong,
Xiliang Zhang,
Huihua He

Affiliations

Shixing Han: College of Information Engineering, Shanghai Maritime University
Jin Liu: College of Information Engineering, Shanghai Maritime University
Jinyingming Zhang: College of Information Engineering, Shanghai Maritime University
Peizhu Gong: College of Information Engineering, Shanghai Maritime University
Xiliang Zhang: College of Information Engineering, Shanghai Maritime University
Huihua He: College of Early Childhood Education, Shanghai Normal University

DOI: https://doi.org/10.1007/s40747-023-00998-5
Journal volume & issue: Vol. 9, no. 5
pp. 4995 – 5012

Abstract

Read online

Abstract Dense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In this article, we propose a novel DVC model named CMCR, which is mainly composed of a cross-modal processing (CM) module and a commonsense reasoning (CR) module. CM utilizes a cross-modal attention mechanism to encode data in different modalities. An event refactoring algorithm is proposed to deal with inaccurate event localization caused by overlapping events. Besides, a shared encoder is utilized to reduce model redundancy. CR optimizes the logic of generated captions with both heterogeneous prior knowledge and entities’ association reasoning achieved by building a knowledge-enhanced unbiased scene graph. Extensive experiments are conducted on ActivityNet Captions dataset, the results demonstrate that our model achieves better performance than state-of-the-art methods. To better understand the performance achieved by CMCR, we also apply ablation experiments to analyze the contributions of different modules.

Published in Complex & Intelligent Systems

ISSN: 2199-4536 (Print); 2198-6053 (Online)
Publisher: Springer
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science; Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: https://www.springer.com/journal/40747

About the journal

Abstract

Keywords