Interpretable Global-Local Dynamics for the Prediction of Eye Fixations in Autonomous Driving Scenarios

Javier Martinez-Cebrian; Miguel-Angel Fernandez-Torres; Fernando Diaz-De-Maria

doi:10.1109/ACCESS.2020.3041606

IEEE Access (Jan 2020)

Interpretable Global-Local Dynamics for the Prediction of Eye Fixations in Autonomous Driving Scenarios

Javier Martinez-Cebrian,
Miguel-Angel Fernandez-Torres,
Fernando Diaz-De-Maria

Affiliations

Javier Martinez-Cebrian: ORCiD; Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Leganés, Spain
Miguel-Angel Fernandez-Torres: ORCiD; Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Leganés, Spain
Fernando Diaz-De-Maria: ORCiD; Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Leganés, Spain

DOI: https://doi.org/10.1109/ACCESS.2020.3041606
Journal volume & issue: Vol. 8
pp. 217068 – 217085

Abstract

Read online

Human eye movements while driving reveal that visual attention largely depends on the context in which it occurs. Furthermore, an autonomous vehicle which performs this function would be more reliable if its outputs were understandable. Capsule Networks have been presented as a great opportunity to explore new horizons in the Computer Vision field, due to their capability to structure and relate latent information. In this article, we present a hierarchical approach for the prediction of eye fixations in autonomous driving scenarios. Context-driven visual attention can be modeled by considering different conditions which, in turn, are represented as combinations of several spatio-temporal features. With the aim of learning these conditions, we have built an encoder-decoder network which merges visual features' information using a global-local definition of capsules. Two types of capsules are distinguished: representational capsules for features and discriminative capsules for conditions. The latter and the use of eye fixations recorded with wearable eye tracking glasses allow the model to learn both to predict contextual conditions and to estimate visual attention, by means of a multi-task loss function. Experiments show how our approach is able to express either frame-level (global) or pixel-wise (local) relationships between features and contextual conditions, allowing for interpretability while maintaining or improving the performance of black-box related systems in the literature. Indeed, our proposal offers an improvement of 29% in terms of information gain with respect to the best performance reported in the literature.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords