Graphic association learning: Multimodal feature extraction and fusion of image and text using artificial intelligence techniques

Guangyun Lu; Zhiping Ni; Ling Wei; Junwei Cheng; Wei Huang

Heliyon (Sep 2024)

Graphic association learning: Multimodal feature extraction and fusion of image and text using artificial intelligence techniques

Guangyun Lu,
Zhiping Ni,
Ling Wei,
Junwei Cheng,
Wei Huang

Affiliations

Guangyun Lu: College of Information Science and Engineering, Liuzhou Institute of Technology, 545616, Liuzhou, Guangxi, China
Zhiping Ni: College of Information Science and Engineering, Liuzhou Institute of Technology, 545616, Liuzhou, Guangxi, China; Corresponding authors.
Ling Wei: College of Information Science and Engineering, Liuzhou Institute of Technology, 545616, Liuzhou, Guangxi, China; Corresponding authors.
Junwei Cheng: College of Information Science and Engineering, Liuzhou Institute of Technology, 545616, Liuzhou, Guangxi, China
Wei Huang: College of automotive Engineering, Liuzhou Institute of Technology, 545616, Liuzhou, Guangxi, China

Journal volume & issue: Vol. 10, no. 18
p. e37167

Abstract

Read online

With the advancement of technology in recent years, the application of artificial intelligence in real life has become more extensive. Graphic recognition is a hot spot in the current research of related technologies. It involves machines extracting key information from pictures and combining it with natural language processing for in-depth understanding. Existing methods still have obvious deficiencies in fine-grained recognition and deep understanding of contextual context. Addressing these issues to achieve high-quality image-text recognition is crucial for various application scenarios, such as accessibility technologies, content creation, and virtual assistants. To tackle this challenge, a novel approach is proposed that combines the Mask R-CNN, DCGAN, and ALBERT models. Specifically, the Mask R-CNN specializes in high-precision image recognition and segmentation, the DCGAN captures and generates nuanced features from images, and the ALBERT model is responsible for deep natural language processing and semantic understanding of this visual information. Experimental results clearly validate the superiority of this method. Compared to traditional image-text recognition techniques, the recognition accuracy is improved from 85.3% to 92.5%, and performance in contextual and situational understanding is enhanced. The advancement of this technology has far-reaching implications for research in machine vision and natural language processing and open new possibilities for practical applications.

Published in Heliyon

ISSN: 2405-8440 (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Science: Science (General); Social Sciences: Social sciences (General)
Website: https://www.cell.com/heliyon/home

About the journal

Abstract

Keywords