Multimodal Attention-Based Instruction-Following Part-Level Affordance Grounding

Wen Qu; Lulu Guo; Jian Cui; Xiao Jin

doi:10.3390/app14114696

Applied Sciences (May 2024)

Multimodal Attention-Based Instruction-Following Part-Level Affordance Grounding

Wen Qu,
Lulu Guo,
Jian Cui,
Xiao Jin

Affiliations

Wen Qu: Computer Science and Technology, Dalian Maritime University, Gaoxin District, Dalian 116026, China
Lulu Guo: Computer Science and Technology, Dalian Maritime University, Gaoxin District, Dalian 116026, China
Jian Cui: Computer Science and Technology, Dalian Maritime University, Gaoxin District, Dalian 116026, China
Xiao Jin: Computer Science and Technology, Dalian Maritime University, Gaoxin District, Dalian 116026, China

DOI: https://doi.org/10.3390/app14114696
Journal volume & issue: Vol. 14, no. 11
p. 4696

Abstract

Read online

The integration of language and vision for object affordance understanding is pivotal for the advancement of embodied agents. Current approaches are often limited by reliance on segregated pre-processing stages for language interpretation and object localization, leading to inefficiencies and error propagation in affordance segmentation. To overcome these limitations, this study introduces a unique task, part-level affordance grounding, in direct response to natural language instructions. We present the Instruction-based Affordance Grounding Network (IAG-Net), a novel architecture that unifies language–vision interactions through a varied-scale multimodal attention mechanism. Unlike existing models, IAG-Net employs two textual–visual feature fusion strategies, capturing both sentence-level and task-specific textual features alongside multiscale visual features for precise and efficient affordance prediction. Our evaluation on two newly constructed vision–language affordance datasets, ITT-AFF VL and UMD VL, demonstrates a significant leap in performance, with an improvement of 11.78% and 0.42% in mean Intersection over Union (mIoU) over cascaded models, bolstering both accuracy and processing speed. We contribute to the research community by releasing our source code and datasets, fostering further innovation and replication of our findings.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords