IEEE Access (Jan 2020)
Cross-Modal Feature Integration Network for Human Eye-Fixation Prediction in RGB-D Images
Abstract
With the advent of convolutional neural networks, research progress in visual saliency prediction has been impressive. While integrating features at different stages from the backbone network is important, feature extraction itself is equally relevant. A network may lose representative information during feature extraction. We address the loss of spatial information and perform a fusion of features extracted from RGB and depth data for eye-fixation prediction. Specifically, we propose an asymmetric feature extraction network comprising an edge guidance module (EGM) and a feature integration module (FIM) that processes RGB-D images. Edge guidance supports the extraction of spatial information, while feature integration merges features from RGB images and the corresponding depth maps. We obtain the eye-fixation prediction maps by linearly fusing the features from the backbone network with those optimized using the two modules. Experimental results on NCTU and NUS, two benchmark datasets for RGB-D saliency prediction, verify the effectiveness and high-performance of the proposed network compared with similar methods.
Keywords