Robot Concept Acquisition Based on Interaction Between Probabilistic and Deep Generative Models

Ryo Kuniyasu; Tomoaki Nakamura; Tadahiro Taniguchi; Takayuki Nagai; Takayuki Nagai

doi:10.3389/fcomp.2021.618069

Frontiers in Computer Science (Sep 2021)

Robot Concept Acquisition Based on Interaction Between Probabilistic and Deep Generative Models

Ryo Kuniyasu,
Tomoaki Nakamura,
Tadahiro Taniguchi,
Takayuki Nagai,
Takayuki Nagai

Affiliations

Ryo Kuniyasu: Department of Mechanical Engineering and Intelligent Systems, The University of Electro-Communications, Tokyo, Japan
Tomoaki Nakamura: Department of Mechanical Engineering and Intelligent Systems, The University of Electro-Communications, Tokyo, Japan
Tadahiro Taniguchi: College of Information Science and Engineering, Ritsumeikan University, Shiga, Japan
Takayuki Nagai: Department of Systems Innovation, Graduate School of Engineering Science, Osaka University, Osaka, Japan
Takayuki Nagai: Artificial Intelligence EXploration Research Center, The University of Electro-Communications, Tokyo, Japan

DOI: https://doi.org/10.3389/fcomp.2021.618069
Journal volume & issue: Vol. 3

Abstract

Read online

We propose a method for multimodal concept formation. In this method, unsupervised multimodal clustering and cross-modal inference, as well as unsupervised representation learning, can be performed by integrating the multimodal latent Dirichlet allocation (MLDA)-based concept formation and variational autoencoder (VAE)-based feature extraction. Multimodal clustering, representation learning, and cross-modal inference are critical for robots to form multimodal concepts from sensory data. Various models have been proposed for concept formation. However, in previous studies, features were extracted using manually designed or pre-trained feature extractors and representation learning was not performed simultaneously. Moreover, the generative probabilities of the features extracted from the sensory data could be predicted, but the sensory data could not be predicted in the cross-modal inference. Therefore, a method that can perform clustering, feature learning, and cross-modal inference among multimodal sensory data is required for concept formation. To realize such a method, we extend the VAE to the multinomial VAE (MNVAE), the latent variables of which follow a multinomial distribution, and construct a model that integrates the MNVAE and MLDA. In the experiments, the multimodal information of the images and words acquired by a robot was classified using the integrated model. The results demonstrated that the integrated model can classify the multimodal information as accurately as the previous model despite the feature extractor learning in an unsupervised manner, suitable image features for clustering can be learned, and cross-modal inference from the words to images is possible.

Published in Frontiers in Computer Science

ISSN: 2624-9898 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.frontiersin.org/journals/computer-science#

About the journal

Abstract

Keywords