Applied Sciences (Jul 2022)
Incorporating External Knowledge Reasoning for Vision-and-Language Navigation with Assistant’s Help
Abstract
Vision-and-Language Navigation (VLN) is a task designed to enable embodied agents carry out natural language instructions in realistic environments. Most VLN tasks, however, are guided by an elaborate set of instructions that is depicted step-by-step. This approach deviates from real-world problems in which humans only describe the object and its surroundings and allow the robot to ask for help when required. Vision-based Navigation with Language-based Assistance (VNLA) is a recently proposed task that requires an agent to navigate and find a target object according to a high-level language instruction. Due to the lack of step-by-step navigation guidance, the key to VNLA is to conduct goal-oriented exploration. In this paper, we design an Attention-based Knowledge-enabled Cross-modality Reasoning with Assistant’s Help (AKCR-AH) model to address the unique challenges of this task. AKCR-AH learns a generalized navigation strategy from three new perspectives: (1) external commonsense knowledge is incorporated into visual relational reasoning, so as to take proper action at each viewpoint by learning the internal–external correlations among object- and room-entities; (2) a simulated human assistant is introduced in the environment, who provides direct intervention assistance when required; (3) a memory-based Transformer architecture is adopted as the policy framework to make full use of the history clues stored in memory tokens for exploration. Extensive experiments demonstrate the effectiveness of our method compared with other baselines.
Keywords