Cyborg and Bionic Systems (Jan 2024)
Leave It to Large Language Models! Correction and Planning with Memory Integration
Abstract
As humans, we can naturally break down a task into individual steps in our daily lives and we are able to provide feedback or dynamically adjust the plan when encountering obstacles. Similarly, our aim is to facilitate agents in comprehending and carrying out natural language instructions in a more efficient and cost-effective manner. For example, in Vision–Language Navigation (VLN) tasks, the agent needs to understand instructions such as “go to the table by the fridge”. This understanding allows the agent to navigate to the table and infer that the destination is likely to be in the kitchen. The traditional VLN approach mainly involves training models using a large number of labeled datasets for task planning in unseen environments. However, manual labeling incurs a high cost for this approach. Considering that large language models (LLMs) already possess extensive commonsense knowledge during pre-training, some researchers have started using LLMs as decision modules in embodied tasks, although this approach shows the LLMs’ reasoning ability to plan a logical sequence of subtasks based on global information. However, executing subtasks often encounters issues, such as obstacles that hinder progress and alterations in the state of the target object. Even one mistake can cause the subsequent tasks to fail, which makes it challenging to complete the instructions through a single plan. Therefore, we propose a new approach—C (Correction) and P (Planning) with M (Memory) I (Integration)—that centered on an LLM for embodied tasks. In more detail, the auxiliary modules of the CPMI facilitate dynamic planning by the LLM-centric planner. These modules provide the agent with memory and generalized experience mechanisms to fully utilize the LLM capabilities, allowing it to improve its performance during execution. Finally, the experimental results on public datasets demonstrate that we achieve the best performance in the few-shot scenario, improving the efficiency of the successive task while increasing the success rate.