Jisuanji kexue yu tansuo (Oct 2022)
Review of Image Captioning Methods Based on Encoding-Decoding Technology
Abstract
In recent years, image caption generation, as a multimodal task in the field of artificial intelligence, integrates the related research of computer vision and natural language processing, and can realize the modal conversion from image to text. It plays an important role in visual assistance and image understanding, and has attracted extensive attention from researchers. Firstly, this paper describes the task of image caption generation, and introduces three image caption generation methods: template-based method, retrieval-based method and encode-decode method. Their respective method ideas, representative research and advantages and disadvantages are also introduced. Secondly, from the model structure, the research progress of image understanding phase and caption generation phase, this paper expounds in detail the method based on encoding-decoding, and summarizes the research over years into the research of image understanding and caption generation. Image understanding research includes attention mechanism and semantic aspects. The research of caption generation is divided into traditional caption generation, dense caption generation and stylish caption generation. The performance, advantages and disadvantages of the model are summarized, and the datasets and evaluation index of the performance evaluation of the image captioning model are introduced. Finally, the challenges and difficulties in the field of image captioning are pointed out.
Keywords