Gaze-Assisted Multi-Stream Deep Neural Network for Action Recognition

Yinan Liu; Qingbo Wu; Liangzhi Tang; Hengcan Shi

doi:10.1109/ACCESS.2017.2753830

IEEE Access (Jan 2017)

Gaze-Assisted Multi-Stream Deep Neural Network for Action Recognition

Yinan Liu,
Qingbo Wu,
Liangzhi Tang,
Hengcan Shi

Affiliations

Yinan Liu: ORCiD; School of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, China
Qingbo Wu: School of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, China
Liangzhi Tang: ORCiD; School of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, China
Hengcan Shi: School of Electronic Engineering, University of Electronic Science and Technology of China, Chengdu, China

DOI: https://doi.org/10.1109/ACCESS.2017.2753830
Journal volume & issue: Vol. 5
pp. 19432 – 19441

Abstract

Read online

There are two important aspects in human action recognition. The first one is how to locate the area that better indicates what the subjects in the videos are doing. The second one is how we can utilize the appearance and motion information from the video data. In this paper, we propose a gaze-assisted deep neural network, which performs the action recognition task with the help of human visual attention. Based on the above-mentioned consideration, we first collect a large number of human gaze data by recording the eye movements of human subjects when they watch the video. Then, we employ a fully convolutional network to learn to predict the human gaze. To efficiently utilize the human gaze, inspired by the rank pooling concept, which can encode the video into one image, we design a novel video representation named by dynamic gaze. The proposed dynamic gaze captures both the appearance and motion information from the video, and our human gaze data can better locate the area of interest. Based on the dynamic gaze, we build our dynamic gaze stream. We combine the proposed dynamic gaze stream together with the two-stream architecture as our final multi-stream architecture. We have collected over 300-k human gaze maps for the J-HMDB data set in this paper, and experiments show that the proposed multi-stream architecture can achieve comparable results with the state of the art in the task of action recognition with both collected human gaze data and predicted human gaze data.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords