PTCR: Knowledge-Based Visual Question Answering Framework Based on Large Language Model

XUE Di, LI Xin, LIU Mingshuai

doi:10.3778/j.issn.1673-9418.2406028

Jisuanji kexue yu tansuo (Nov 2024)

PTCR: Knowledge-Based Visual Question Answering Framework Based on Large Language Model

XUE Di, LI Xin, LIU Mingshuai

Affiliations

XUE Di, LI Xin, LIU Mingshuai: 1. School of Information and Cyber Security, People??s Public Security University of China, Beijing 100038, China 2. Key Laboratory of Security Technology and Risk Assessment, Ministry of Public Security, Beijing 100026, China

DOI: https://doi.org/10.3778/j.issn.1673-9418.2406028
Journal volume & issue: Vol. 18, no. 11
pp. 2912 – 2924

Abstract

Read online

Aiming at the problems of insufficient model input information and poor reasoning performance in knowledge-based visual question answering (VQA), this paper constructs a PTCR knowledge-based VQA framework based on large language model (LLM), which consists of four parts: answer candidate generation, targeted image descriptions, autonomous chain of thought (CoT) construction, and prompted LLM inference. The PTCR framework uses LLM to guide multimodal large language models to generate targeted image descriptions, which solves the problem of incomplete coverage of previous image captions. It improves the model??s reasoning ability by guiding LLM to autonomously generate CoT, which provides the thinking process of similar problems during the reasoning process; and it introduces selection rearrangement technology to eliminate LLM??s selection location discrimination during the reasoning process, and reduces the randomness error of the reasoning by means of majority voting. Experimental results show that the accuracy of the CogVLM model enhanced by the PTCR framework is improved by 16.7 percentage points and 13.3 percentage points on the OK-VQA and A-OKVQA datasets. Meanwhile, compared with Prophet, the accuracy of the PTCR framework is improved by 3.4 percentage points and 5.0 percentage points on the OK-VQA and A-OKVQA datasets. The results of ablation experiments demonstrate that the methods used in this paper, such as targeted image descriptions and autonomous chains of thought, are all effective in improving accuracy. It is evident that the PTCR framework has improved the performance of knowledge-based VQA.

visual question answering; prompt engineering; large language model; cross-modal

Published in Jisuanji kexue yu tansuo

ISSN: 1673-9418 (Print)
Publisher: Journal of Computer Engineering and Applications Beijing Co., Ltd., Science Press
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://fcst.ceaj.org

About the journal

Abstract

Keywords