IEEE Access (Jan 2021)

Transfer Learning for Humanoid Robot Appearance-Based Localization in a Visual Map

  • Emmanuel Ovalle-Magallanes,
  • Noe G. Aldana-Murillo,
  • Juan Gabriel Avina-Cervantes,
  • Jose Ruiz-Pinales,
  • Jonathan Cepeda-Negrete,
  • Sergio Ledesma

DOI
https://doi.org/10.1109/ACCESS.2020.3048936
Journal volume & issue
Vol. 9
pp. 6868 – 6877

Abstract

Read online

Autonomous robot visual navigation is a fundamental locomotion task based on extracting relevant features from images taken from the surrounded environment to control an independent displacement. In the navigation, the use of a known visual map helps obtain an accurate localization, but in the absence of this map, a guided or free exploration pathway must be executed to obtain the images sequence representing the visual map. This paper presents an appearance-based localization method based on a visual map and an end-to-end Convolutional Neural Network (CNN). The CNN is initialized via transfer learning (trained using the ImageNet dataset), evaluating four state-of-the-art CNN architectures: VGG16, ResNet50, InceptionV3, and Xception. A typical pipeline for transfer learning includes changing the last layer to adapt the number of neurons according to the number of custom classes. In this work, the dense layers after the convolutional and pooling layers were substituted by a Global Average Pooling (GAP) layer, which is parameter-free. Additionally, an L2 -norm constraint was added to the GAP layer feature descriptors, restricting the features from lying on a fixed radius hypersphere. These different pre-trained configurations were analyzed and compared using two visual maps found in the CIMAT-NAO datasets consisting of 187 and 94 images, respectively. For evaluating the localization tasks, a set of 278 and 94 images were available for each visual map, respectively. The numerical results proved that by integrating the L2-norm constraint in the training pipeline, the appearance-based localization performance is boosted. Specifically, the pre-trained VGG16 and Xception networks achieved the best localization results, reaching a top-3 accuracy of 90.70% and 93.62% for each dataset, respectively, overcoming the referenced approaches based on hand-crafted feature extractors.

Keywords