Parallel Pathway Dense Video Captioning With Deformable Transformer

Wangyu Choi; Jiasi Chen; Jongwon Yoon

doi:10.1109/ACCESS.2022.3228821

IEEE Access (Jan 2022)

Parallel Pathway Dense Video Captioning With Deformable Transformer

Wangyu Choi,
Jiasi Chen,
Jongwon Yoon

Affiliations

Wangyu Choi: ORCiD; Department of Computer Science and Engineering (Major in Bio Artificial Intelligence), Hanyang University, Ansan, South Korea
Jiasi Chen: ORCiD; Department of Computer Science and Engineering, University of California Riverside, Riverside, CA, USA
Jongwon Yoon: ORCiD; Department of Computer Science and Engineering (Major in Bio Artificial Intelligence), Hanyang University, Ansan, South Korea

DOI: https://doi.org/10.1109/ACCESS.2022.3228821
Journal volume & issue: Vol. 10
pp. 129899 – 129910

Abstract

Read online

Dense video captioning is a very challenging task because it requires a high-level understanding of the video story, as well as pinpointing details such as objects and motions for a consistent and fluent description of the video. Many existing solutions divide this problem into two sub-tasks, event detection and captioning, and solve them sequentially (“localize-then-describe” or reverse). Consequently, the final outcome is highly dependent on the performance of the preceding modules. In this paper, we decompose this sequential approach by proposing a parallel pathway dense video captioning framework that localizes and describes events simultaneously without any bottlenecks. We introduce a representation organization network at the branching point of the parallel pathway to organize the encoded video feature by considering the entire storyline. Then, an event localizer focuses to localize events without any event proposal generation network, a sentence generator describes events while considering the fluency and coherency of sentences. Our method has several advantages over existing work: (i) the final output does not depend on the output of the preceding modules, (ii) it improves existing parallel decoding methods by relieving the bottleneck of information. We evaluate the performance of PPVC on large-scale benchmark datasets, the ActivityNet Captions, and YouCook2. PPVC not only outperforms existing algorithms on the majority of metrics but also improves on both datasets by 5.4% and 4.9% compared to the state-of-the-art parallel decoding method.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords