Jisuanji kexue yu tansuo (Nov 2024)
PTCR: Knowledge-Based Visual Question Answering Framework Based on Large Language Model
Abstract
Aiming at the problems of insufficient model input information and poor reasoning performance in knowledge-based visual question answering (VQA), this paper constructs a PTCR knowledge-based VQA framework based on large language model (LLM), which consists of four parts: answer candidate generation, targeted image descriptions, autonomous chain of thought (CoT) construction, and prompted LLM inference. The PTCR framework uses LLM to guide multimodal large language models to generate targeted image descriptions, which solves the problem of incomplete coverage of previous image captions. It improves the model??s reasoning ability by guiding LLM to autonomously generate CoT, which provides the thinking process of similar problems during the reasoning process; and it introduces selection rearrangement technology to eliminate LLM??s selection location discrimination during the reasoning process, and reduces the randomness error of the reasoning by means of majority voting. Experimental results show that the accuracy of the CogVLM model enhanced by the PTCR framework is improved by 16.7 percentage points and 13.3 percentage points on the OK-VQA and A-OKVQA datasets. Meanwhile, compared with Prophet, the accuracy of the PTCR framework is improved by 3.4 percentage points and 5.0 percentage points on the OK-VQA and A-OKVQA datasets. The results of ablation experiments demonstrate that the methods used in this paper, such as targeted image descriptions and autonomous chains of thought, are all effective in improving accuracy. It is evident that the PTCR framework has improved the performance of knowledge-based VQA.
Keywords