Cross‐modal fusion encoder via graph neural network for referring image segmentation

Yuqing Zhang; Yong Zhang; Xinglin Piao; Peng Yuan; Yongli Hu; Baocai Yin

doi:10.1049/ipr2.13008

IET Image Processing (Mar 2024)

Cross‐modal fusion encoder via graph neural network for referring image segmentation

Yuqing Zhang,
Yong Zhang,
Xinglin Piao,
Peng Yuan,
Yongli Hu,
Baocai Yin

Affiliations

Yuqing Zhang: Beijing Key Laboratory of Multimedia and Intelligent Software Technology Beijing Artificial Intelligence Institute Faculty of Information Technology, Beijing University of Technology Beijing China
Yong Zhang: Beijing Key Laboratory of Multimedia and Intelligent Software Technology Beijing Artificial Intelligence Institute Faculty of Information Technology, Beijing University of Technology Beijing China
Xinglin Piao: Beijing Key Laboratory of Multimedia and Intelligent Software Technology Beijing Artificial Intelligence Institute Faculty of Information Technology, Beijing University of Technology Beijing China
Peng Yuan: China Electronics Technology Group Taiji Co Ltd Beijing China
Yongli Hu: Beijing Key Laboratory of Multimedia and Intelligent Software Technology Beijing Artificial Intelligence Institute Faculty of Information Technology, Beijing University of Technology Beijing China
Baocai Yin: Beijing Key Laboratory of Multimedia and Intelligent Software Technology Beijing Artificial Intelligence Institute Faculty of Information Technology, Beijing University of Technology Beijing China

DOI: https://doi.org/10.1049/ipr2.13008
Journal volume & issue: Vol. 18, no. 4
pp. 1083 – 1095

Abstract

Read online

Abstract Referring image segmentation identifies the object masks from images with the guidance of input natural language expressions. Nowadays, many remarkable cross‐modal decoder are devoted to this task. But there are mainly two key challenges in these models. One is that these models usually lack to extract fine‐grained boundary information and gradient information of images. The other is that these models usually lack to explore language associations among image pixels. In this work, a Multi‐scale Gradient balanced Central Difference Convolution (MG‐CDC) and a Graph convolutional network‐based Language and Image Fusion (GLIF) for cross‐modal encoder, called Graph‐RefSeg, are designed. Specifically, in the shallow layer of the encoder, the MG‐CDC captures comprehensive fine‐grained image features. It could enhance the perception of target boundaries and provide effective guidance for deeper encoding layers. In each encoder layer, the GLIF is used for cross‐modal fusion. It could explore the correlation of every pixel and its corresponding language vectors by a graph neural network. Since the encoder achieves robust cross‐modal alignment and context mining, a light‐weight decoder could be used for segmentation prediction. Extensive experiments show that the proposed Graph‐RefSeg outperforms the state‐of‐the‐art methods on three public datasets. Code and models will be made publicly available at https://github.com/ZYQ111/Graph_refseg.

Published in IET Image Processing

ISSN: 1751-9659 (Print); 1751-9667 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Technology: Photography; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/17519667

About the journal

Abstract

Keywords