FB-Net: Dual-Branch Foreground-Background Fusion Network With Multi-Scale Semantic Scanning for Image-Text Retrieval

Junhao Xu; Zheng Liu; Xinlei Pei; Shuhuai Wang; Shanshan Gao

doi:10.1109/ACCESS.2023.3263512

IEEE Access (Jan 2023)

FB-Net: Dual-Branch Foreground-Background Fusion Network With Multi-Scale Semantic Scanning for Image-Text Retrieval

Junhao Xu,
Zheng Liu,
Xinlei Pei,
Shuhuai Wang,
Shanshan Gao

Affiliations

Junhao Xu: ORCiD; School of Computer Science and Technology, Shandong University of Finance and Economics, Ji’nan, China
Zheng Liu: ORCiD; School of Computer Science and Technology, Shandong University of Finance and Economics, Ji’nan, China
Xinlei Pei: ORCiD; School of Computer Science and Technology, Shandong University of Finance and Economics, Ji’nan, China
Shuhuai Wang: ORCiD; School of Computer Science and Technology, Shandong University of Finance and Economics, Ji’nan, China
Shanshan Gao: School of Computer Science and Technology, Shandong University of Finance and Economics, Ji’nan, China

DOI: https://doi.org/10.1109/ACCESS.2023.3263512
Journal volume & issue: Vol. 11
pp. 36516 – 36537

Abstract

Read online

As a fundamental branch in cross-modal retrieval, image-text retrieval is still a challenging problem largely due to the complementary and imbalanced relationship between different modalities. However, existing works have not effectively scanned and aligned the semantic units distributed in different granularities of images and texts. To address these issues, we propose a dual-branch foreground-background fusion network (FB-Net), which is implemented by fully exploring and fusing the complementarity in semantic units collected from the foreground and background areas of instances (e.g., images and texts). Firstly, to generate multi-granularity semantic units from images and texts, multi-scale semantic scanning is conducted on both foreground and background areas through multi-level overlapped sliding windows. Secondly, to align semantic units between images and texts, the stacked cross-attention mechanism is used to calculate the initial image-text similarity. Thirdly, to further adaptively optimize the image-text similarity, the dynamically self-adaptive weighted loss is designed. Finally, to perform the image-text retrieval, the similarities between multi-granularity foreground and background semantic units are fused to obtain the final image-text similarity. Experimental results show that our proposed FB-Net outperforms representative state-of-the-art methods for image-text retrieval, and ablation studies further verify the effectiveness of each component in FB-Net.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords