Towards Human-Interactive Controllable Video Captioning with Efficient Modeling

Yoonseok Heo; Taehoon Kim; Seunghwan Kim; Jungyun Seo; Juae Kim

doi:10.3390/math12132037

Mathematics (Jun 2024)

Towards Human-Interactive Controllable Video Captioning with Efficient Modeling

Yoonseok Heo,
Taehoon Kim,
Seunghwan Kim,
Jungyun Seo,
Juae Kim

Affiliations

Yoonseok Heo: Department of Computer Science and Engineering, Sogang University, Seoul 04107, Republic of Korea
Taehoon Kim: LG AI Research, Seoul 07796, Republic of Korea
Seunghwan Kim: LG AI Research, Seoul 07796, Republic of Korea
Jungyun Seo: LG AI Research, Seoul 07796, Republic of Korea
Juae Kim: Department of English Linguistics and Language Technology, Division of Language & AI, Hankuk University of Foreign Studies, Seoul 02450, Republic of Korea

DOI: https://doi.org/10.3390/math12132037
Journal volume & issue: Vol. 12, no. 13
p. 2037

Abstract

Read online

Video captioning is a task of describing the visual scene of a given video in natural language. There have been several lines of research focused on developing large-scale models in a transfer learning paradigm, with major challenge being the tradeoff between scalability and performance in limited environments. To address this problem, we propose a simple yet effective encoder–decoder-based video captioning model integrating transformers and CLIP, both of which are widely adopted in the vision and language domains, together with appropriate temporal feature embedding modules. Taking this proposal a step further, we also address the challenge of human-interactive video captioning, where the captions are tailored to specific information desired by humans. To design a human-interactive environment, we assume that a human offers an object or action in the video as a short prompt; in turn, the system then provides a detailed explanation regarding the prompt. We embed human prompts within an LSTM-based prompt encoder and leverage soft prompting to tune the model effectively. We extensively evaluated our model on benchmark datasets, demonstrating comparable results, particularly on the MSR-VTT dataset, where we achieve state-of-the-art performance with 4% improvement. In addition, we also show potential for human-interactive video captioning through quantitative and qualitative analysis.

Published in Mathematics

ISSN: 2227-7390 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Mathematics
Website: http://www.mdpi.com/journal/mathematics

About the journal

Abstract

Keywords