Sensors (Aug 2025)
InstructSee: Instruction-Aware and Feedback-Driven Multimodal Retrieval with Dynamic Query Generation
Abstract
In recent years, cross-modal retrieval has garnered significant attention due to its potential to bridge heterogeneous data modalities, particularly in aligning visual content with natural language. Despite notable progress, existing methods often struggle to accurately capture user intent when queries are expressed through complex or evolving instructions. To address this challenge, we propose a novel cross-modal representation learning framework that incorporates an instruction-aware dynamic query generation mechanism, augmented by the semantic reasoning capabilities of large language models (LLMs). The framework dynamically constructs and iteratively refines query representations conditioned on natural language instructions and guided by user feedback, thereby enabling the system to effectively infer and adapt to implicit retrieval intent. Extensive experiments on standard multimodal retrieval benchmarks demonstrate that our method significantly improves retrieval accuracy and adaptability, outperforming fixed-query baselines and showing enhanced cross-modal alignment and generalization across diverse retrieval tasks.
Keywords