PWS-DVC: Enhancing Weakly Supervised Dense Video Captioning With Pretraining Approach

Wangyu Choi; Jiasi Chen; Jongwon Yoon

doi:10.1109/ACCESS.2023.3331756

IEEE Access (Jan 2023)

PWS-DVC: Enhancing Weakly Supervised Dense Video Captioning With Pretraining Approach

Wangyu Choi,
Jiasi Chen,
Jongwon Yoon

Affiliations

Wangyu Choi: ORCiD; Department of Computer Science and Engineering, Hanyang University, Ansan, South Korea
Jiasi Chen: Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA
Jongwon Yoon: ORCiD; Department of Computer Science and Engineering, Hanyang University, Ansan, South Korea

DOI: https://doi.org/10.1109/ACCESS.2023.3331756
Journal volume & issue: Vol. 11
pp. 128162 – 128174

Abstract

Read online

In recent times, there has been a notable increase in efforts to simultaneously comprehend vision and language, driven by the availability of video-related datasets and advancements in language models within the domain of natural language processing. Dense video captioning poses a significant challenge in understanding untrimmed video and generating several event-based sentences to describe the video. Numerous endeavors have been undertaken to enhance the efficacy of the dense video captioning task by the utilization of various approaches, such as bottom-up, top-down, parallel pipeline, pretraining, etc. In contrast, the weakly supervised dense video captioning method presents a highly promising strategy for generating dense video captions solely based on captions, without relying on any knowledge of ground-truth events, which distinguishes it from widely employed approaches. Nevertheless, this approach has a drawback that inadequate captions might hurt both event localization and captioning. This paper introduces PWS-DVC, a novel approach aimed at enhancing the performance of weakly supervised dense video captioning. PWS-DVC’s event captioning module is initially trained on video-clip datasets, which are extensively accessible video datasets by leveraging the absence of ground-truth data during training. Subsequently, it undergoes fine-tuning specifically for dense video captioning. In order to demonstrate the efficacy of PWS-DVC, we conduct comparative experiments with state-of-the-art methods using the ActivityNet Captions dataset. The findings indicate that PWS-DVC exhibits improved performance in comparison to current approaches in weakly supervised dense video captioning.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords