Thangka Image Captioning Based on Semantic Concept Prompt and Multimodal Feature Optimization

Wenjin Hu; Lang Qiao; Wendong Kang; Xinyue Shi

doi:10.3390/jimaging9080162

Journal of Imaging (Aug 2023)

Thangka Image Captioning Based on Semantic Concept Prompt and Multimodal Feature Optimization

Wenjin Hu,
Lang Qiao,
Wendong Kang,
Xinyue Shi

Affiliations

Wenjin Hu: School of Mathematics and Computer Science, Northwest Minzu Univsersity, Lanzhou 730030, China
Lang Qiao: Key Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, China
Wendong Kang: Key Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, China
Xinyue Shi: Key Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, China

DOI: https://doi.org/10.3390/jimaging9080162
Journal volume & issue: Vol. 9, no. 8
p. 162

Abstract

Read online

Thangka images exhibit a high level of diversity and richness, and the existing deep learning-based image captioning methods generate poor accuracy and richness of Chinese captions for Thangka images. To address this issue, this paper proposes a Semantic Concept Prompt and Multimodal Feature Optimization network (SCAMF-Net). The Semantic Concept Prompt (SCP) module is introduced in the text encoding stage to obtain more semantic information about the Thangka by introducing contextual prompts, thus enhancing the richness of the description content. The Multimodal Feature Optimization (MFO) module is proposed to optimize the correlation between Thangka images and text. This module enhances the correlation between the image features and text features of the Thangka through the Captioner and Filter to more accurately describe the visual concept features of the Thangka. The experimental results demonstrate that our proposed method outperforms baseline models on the Thangka dataset in terms of BLEU-4, METEOR, ROUGE, CIDEr, and SPICE by 8.7%, 7.9%, 8.2%, 76.6%, and 5.7%, respectively. Furthermore, this method also exhibits superior performance compared to the state-of-the-art methods on the public MSCOCO dataset.

Published in Journal of Imaging

ISSN: 2313-433X (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Photography; Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.mdpi.com/journal/jimaging

About the journal

Abstract

Keywords