Survey: Transformer based video-language pre-training

Ludan Ruan; Qin Jin

AI Open (Jan 2022)

Survey: Transformer based video-language pre-training

Ludan Ruan,
Qin Jin

Affiliations

Ludan Ruan: School of Information, Renmin University of China, Beijing, China
Qin Jin: Corresponding author.; School of Information, Renmin University of China, Beijing, China

Journal volume & issue: Vol. 3
pp. 1 – 13

Abstract

Read online

Inspired by the success of transformer-based pre-training methods on natural language tasks and further computer vision tasks, researchers have started to apply transformer to video processing. This survey aims to provide a comprehensive overview of transformer-based pre-training methods for Video-Language learning. We first briefly introduce the transformer structure as the background knowledge, including attention mechanism, position encoding etc. We then describe the typical paradigm of pre-training & fine-tuning on Video-Language processing in terms of proxy tasks, downstream tasks and commonly used video datasets. Next, we categorize transformer models into Single-Stream and Multi-Stream structures, highlight their innovations and compare their performances. Finally, we analyze and discuss the current challenges and possible future research directions for Video-Language pre-training.

Published in AI Open

ISSN: 2666-6510 (Online)
Publisher: KeAi Communications Co. Ltd.
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.keaipublishing.com/en/journals/ai-open/

About the journal

Abstract

Keywords