Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models

Longbin Jin; Hyuntaek Jung; Hyo Jin Jon; Eun Yi Kim

doi:10.3390/math13091365

Mathematics (Apr 2025)

Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models

Longbin Jin,
Hyuntaek Jung,
Hyo Jin Jon,
Eun Yi Kim

Affiliations

Longbin Jin: Artificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of Korea
Hyuntaek Jung: Artificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of Korea
Hyo Jin Jon: Artificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of Korea
Eun Yi Kim: Artificial Intelligence & Computer Vision Lab., Konkuk University, Seoul 05029, Republic of Korea

DOI: https://doi.org/10.3390/math13091365
Journal volume & issue: Vol. 13, no. 9
p. 1365

Abstract

Read online

Large-scale Visual-Language Models have demonstrated powerful adaptability in video recognition tasks. However, existing methods typically rely on fine-tuning or text prompt tuning. In this paper, we propose a visual-only prompting method that employs watermark and trademark prompts to bridge the distribution gap of spatial-temporal video data with Visual-Language Models. Our watermark prompts, designed by a trainable prompt generator, are customized for each video clip. Unlike conventional visual prompts that often exhibit noise signals, watermark prompts are intentionally designed to be imperceptible, ensuring they are not misinterpreted as an adversarial attack. The trademark prompts, bespoke for each video domain, establish the identity of specific video types. Integrating watermark prompts into video frames and prepending trademark prompts to per-frame embeddings significantly boosts the capability of the Visual-Language Model to understand video. Notably, our approach improves the adaptability of the CLIP model to various video action recognition datasets, achieving performance gains of 16.8%, 18.4%, and 13.8% on HMDB-51, UCF-101, and the egocentric dataset EPIC-Kitchen-100, respectively. Additionally, our visual-only prompting method demonstrates competitive performance compared with existing fine-tuning and adaptation methods while requiring fewer learnable parameters. Moreover, through extensive ablation studies, we find the optimal balance between imperceptibility and adaptability. Code will be made available.

Published in Mathematics

ISSN: 2227-7390 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Mathematics
Website: http://www.mdpi.com/journal/mathematics

About the journal

Abstract

Keywords