Frontiers in Robotics and AI (Sep 2024)
Bridging vision and touch: advancing robotic interaction prediction with self-supervised multimodal learning
Abstract
Predicting the consequences of the agent’s actions on its environment is a pivotal challenge in robotic learning, which plays a key role in developing higher cognitive skills for intelligent robots. While current methods have predominantly relied on vision and motion data to generate the predicted videos, more comprehensive sensory perception is required for complex physical interactions such as contact-rich manipulation or highly dynamic tasks. In this work, we investigate the interdependence between vision and tactile sensation in the scenario of dynamic robotic interaction. A multi-modal fusion mechanism is introduced to the action-conditioned video prediction model to forecast future scenes, which enriches the single-modality prototype with a compressed latent representation of multiple sensory inputs. Additionally, to accomplish the interactive setting, we built a robotic interaction system that is equipped with both web cameras and vision-based tactile sensors to collect the dataset of vision-tactile sequences and the corresponding robot action data. Finally, through a series of qualitative and quantitative comparative study of different prediction architecture and tasks, we present insightful analysis of the cross-modality influence between vision, tactile and action, revealing the asymmetrical impact that exists between the sensations when contributing to interpreting the environment information. This opens possibilities for more adaptive and efficient robotic control in complex environments, with implications for dexterous manipulation and human-robot interaction.
Keywords