InstructSee: Instruction-Aware and Feedback-Driven Multimodal Retrieval with Dynamic Query Generation

Guihe Gu; Yuan Xue; Zhengqian Wu; Lin Song; Chao Liang

doi:10.3390/s25165195

Sensors (Aug 2025)

InstructSee: Instruction-Aware and Feedback-Driven Multimodal Retrieval with Dynamic Query Generation

Guihe Gu,
Yuan Xue,
Zhengqian Wu,
Lin Song,
Chao Liang

Affiliations

Guihe Gu: National Engineering Research Center for Multimedia Software (NERCMS), Wuhan 430072, China
Yuan Xue: National Engineering Research Center for Multimedia Software (NERCMS), Wuhan 430072, China
Zhengqian Wu: National Engineering Research Center for Multimedia Software (NERCMS), Wuhan 430072, China
Lin Song: National Engineering Research Center for Multimedia Software (NERCMS), Wuhan 430072, China
Chao Liang: National Engineering Research Center for Multimedia Software (NERCMS), Wuhan 430072, China

DOI: https://doi.org/10.3390/s25165195
Journal volume & issue: Vol. 25, no. 16
p. 5195

Abstract

Read online

In recent years, cross-modal retrieval has garnered significant attention due to its potential to bridge heterogeneous data modalities, particularly in aligning visual content with natural language. Despite notable progress, existing methods often struggle to accurately capture user intent when queries are expressed through complex or evolving instructions. To address this challenge, we propose a novel cross-modal representation learning framework that incorporates an instruction-aware dynamic query generation mechanism, augmented by the semantic reasoning capabilities of large language models (LLMs). The framework dynamically constructs and iteratively refines query representations conditioned on natural language instructions and guided by user feedback, thereby enabling the system to effectively infer and adapt to implicit retrieval intent. Extensive experiments on standard multimodal retrieval benchmarks demonstrate that our method significantly improves retrieval accuracy and adaptability, outperforming fixed-query baselines and showing enhanced cross-modal alignment and generalization across diverse retrieval tasks.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords