Self-Organizing Memory Based on Adaptive Resonance Theory for Vision and Language Navigation

Wansen Wu; Yue Hu; Kai Xu; Long Qin; Quanjun Yin

doi:10.3390/math11194192

Mathematics (Oct 2023)

Self-Organizing Memory Based on Adaptive Resonance Theory for Vision and Language Navigation

Wansen Wu,
Yue Hu,
Kai Xu,
Long Qin,
Quanjun Yin

Affiliations

Wansen Wu: College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
Yue Hu: College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
Kai Xu: College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
Long Qin: College of Systems Engineering, National University of Defense Technology, Changsha 410073, China
Quanjun Yin: College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

DOI: https://doi.org/10.3390/math11194192
Journal volume & issue: Vol. 11, no. 19
p. 4192

Abstract

Read online

Vision and Language Navigation (VLN) is a task in which an agent needs to understand natural language instructions to reach the target location in a real-scene environment. To improve the model ability of long-horizon planning, emerging research focuses on extending the models with different types of memory structures, mainly including topological maps or a hidden state vector. However, the fixed-length hidden state vector is often insufficient to capture long-term temporal context. In comparison, topological maps have been shown to be beneficial for many robotic navigation tasks. Therefore, we focus on building a feasible and effective topological map representation and using it to improve the navigation performance and the generalization across seen and unseen environments. This paper presents a S elf-organizing Memory based on Adaptive Resonance Theory (SMART) module for incremental topological mapping and a framework for utilizing the SMART module to guide navigation. Based on fusion adaptive resonance theory networks, the SMART module can extract salient scenes from historical observations and build a topological map of the environmental layout. It provides a compact spatial representation and supports the discovery of novel shortcuts through inferences while being explainable in terms of cognitive science. Furthermore, given a language instruction and on top of the topological map, we propose a vision–language alignment framework for navigational decision-making. Notably, the framework utilizes three off-the-shelf pre-trained models to perform landmark extraction, node–landmark matching, and low-level controlling, without any fine-tuning on human-annotated datasets. We validate our approach using the Habitat simulator on VLN-CE tasks, which provides a photo-realistic environment for the embodied agent in continuous action space. The experimental results demonstrate that our approach achieves comparable performance to the supervised baseline.

Published in Mathematics

ISSN: 2227-7390 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Mathematics
Website: http://www.mdpi.com/journal/mathematics

About the journal

Abstract

Keywords