Step by Step: A Gradual Approach for Dense Video Captioning

Wangyu Choi; Jiasi Chen; Jongwon Yoon

doi:10.1109/ACCESS.2023.3279816

IEEE Access (Jan 2023)

Step by Step: A Gradual Approach for Dense Video Captioning

Wangyu Choi,
Jiasi Chen,
Jongwon Yoon

Affiliations

Wangyu Choi: ORCiD; Department of Computer Science and Engineering (Major in Bio Artificial Intelligence), Hanyang University, Ansan, South Korea
Jiasi Chen: ORCiD; Department of Computer Science and Engineering, University of California at Riverside, Riverside, CA, USA
Jongwon Yoon: ORCiD; Department of Computer Science and Engineering (Major in Bio Artificial Intelligence), Hanyang University, Ansan, South Korea

DOI: https://doi.org/10.1109/ACCESS.2023.3279816
Journal volume & issue: Vol. 11
pp. 51949 – 51959

Abstract

Read online

Dense video captioning aims to localize and describe events for storytelling in untrimmed videos. It is a conceptually very challenging task that requires concise, relevant, and coherent captioning based on high-quality event localization. Unlike simple temporal action localization tasks without overlapping events, dense video captioning requires detecting multiple/overlapping regions in order to branch out the video story. Most existing methods generate numerous candidate event proposals and then eliminate duplicate ones using a event proposal selection algorithm (e.g., non-maximum suppression) or generate event proposals directly through box prediction and binary classification mechanisms, similar to object detection tasks. Despite these efforts, the aforementioned approaches tend to fail to localize overlapping events into different stories, hindering high-quality captioning. In this paper, we propose SBS, a dense video captioning framework with a gradual approach that addresses the challenge of localizing overlapping events and eventually constructs high-quality captioning. SBS accurately estimates the number of explicit events for each video snippet and then detects the boundaries context/activities, which are the details for generating the event proposals. Based on both the number of events and boundaries, SBS generates the event proposals. SBS encodes the context of the event sequence and finally generates sentences describing the event proposals. Our framework is fairly effective in localizing multiple/overlapping events, thus experimental results show the state-of-the-art performance compared to the existing methods.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords