IEEE Access (Jan 2020)

An Interpretable Visual Attention Plug-in for Convolutions

  • Chih-Yang Lin,
  • Chia-Lin Wu,
  • Hui-Fuang Ng,
  • Timothy K. Shih

DOI
https://doi.org/10.1109/ACCESS.2020.3011696
Journal volume & issue
Vol. 8
pp. 136992 – 137003

Abstract

Read online

Raw images, which may contain many noisy background pixels, are typically used in convolutional neural network (CNN) training. This paper proposes a novel variance loss function based on a ground truth mask of the target object to enhance the visual attention of a CNN. The loss function regularizes the training process so that the feature maps in the later convolutional layer are focused more on target object areas and less on the background. Attention loss is computed directly from the feature maps, so no new parameters are added to the backbone network; therefore, no extra computational cost is added to the testing phase. The proposed attention model can be a plug-in for any pre-trained network architecture and can be used in conjunction with other attention models. Experimental results demonstrate that the proposed variance loss function improves classification accuracy by 2.22% over the baseline on the Stanford Dogs dataset, which is significantly higher than the improvements achieved by SENet (0.3%) and CBAM (1.14%). Our method also improves object detection accuracy by 2.5 mAP on the Pascal-VOC2007 dataset and store sign detection by 2.66 mAP over respective baseline models. Furthermore, the proposed loss function enhances the visualization and interpretability of a CNN.

Keywords