Heliyon (Sep 2024)

Graphic association learning: Multimodal feature extraction and fusion of image and text using artificial intelligence techniques

  • Guangyun Lu,
  • Zhiping Ni,
  • Ling Wei,
  • Junwei Cheng,
  • Wei Huang

Journal volume & issue
Vol. 10, no. 18
p. e37167

Abstract

Read online

With the advancement of technology in recent years, the application of artificial intelligence in real life has become more extensive. Graphic recognition is a hot spot in the current research of related technologies. It involves machines extracting key information from pictures and combining it with natural language processing for in-depth understanding. Existing methods still have obvious deficiencies in fine-grained recognition and deep understanding of contextual context. Addressing these issues to achieve high-quality image-text recognition is crucial for various application scenarios, such as accessibility technologies, content creation, and virtual assistants. To tackle this challenge, a novel approach is proposed that combines the Mask R-CNN, DCGAN, and ALBERT models. Specifically, the Mask R-CNN specializes in high-precision image recognition and segmentation, the DCGAN captures and generates nuanced features from images, and the ALBERT model is responsible for deep natural language processing and semantic understanding of this visual information. Experimental results clearly validate the superiority of this method. Compared to traditional image-text recognition techniques, the recognition accuracy is improved from 85.3% to 92.5%, and performance in contextual and situational understanding is enhanced. The advancement of this technology has far-reaching implications for research in machine vision and natural language processing and open new possibilities for practical applications.

Keywords