A Survey of Vision and Language Related Multi-Modal Task

Lanxiao Wang; Wenzhe Hu; Heqian Qiu; Chao Shang; Taijin Zhao; Benliu Qiu; King Ngi Ngan; Hongliang Li

doi:10.26599/AIR.2022.9150008

CAAI Artificial Intelligence Research (Dec 2022)

A Survey of Vision and Language Related Multi-Modal Task

Lanxiao Wang,
Wenzhe Hu,
Heqian Qiu,
Chao Shang,
Taijin Zhao,
Benliu Qiu,
King Ngi Ngan,
Hongliang Li

Affiliations

Lanxiao Wang: Department of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Wenzhe Hu: Department of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Heqian Qiu: Department of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Chao Shang: Department of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Taijin Zhao: Department of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
Benliu Qiu: Department of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
King Ngi Ngan: The Chinese University of Hong Kong, Hong Kong 999077, China
Hongliang Li: Department of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

DOI: https://doi.org/10.26599/AIR.2022.9150008
Journal volume & issue: Vol. 1, no. 2
pp. 111 – 136

Abstract

Read online

With the significant breakthrough in the research of single-modal related deep learning tasks, more and more works begin to focus on multi-modal tasks. Multi-modal tasks usually involve more than one different modalities, and a modality represents a type of behavior or state. Common multi-modal information includes vision, hearing, language, touch, and smell. Vision and language are two of the most common modalities in human daily life, and many typical multi-modal tasks focus on these two modalities, such as visual captioning and visual grounding. In this paper, we conduct in-depth research on typical tasks of vision and language from the perspectives of generation, analysis, and reasoning. First, the analysis and summary with the typical tasks and some pretty classical methods are introduced, which will be generalized from the aspects of different algorithmic concerns, and be further discussed frequently used datasets and metrics. Then, some other variant tasks and cutting-edge tasks are briefly summarized to build a more comprehensive vision and language related multi-modal tasks framework. Finally, we further discuss the development of pre-training related research and make an outlook for future research. We hope this survey can help relevant researchers to understand the latest progress, existing problems, and exploration directions of vision and language multi-modal related tasks, and provide guidance for future research.

Published in CAAI Artificial Intelligence Research

ISSN: 2097-194X (Print); 2097-3691 (Online)
Publisher: Tsinghua University Press
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.sciopen.com/journal/2097-194X

About the journal

Abstract

Keywords