Jisuanji kexue (Oct 2022)

Cross-scale Feature Fusion Self-attention for Image Captioning

  • WANG Ming-zhan, JI Jun-zhong, JIA Ao-zhe, ZHANG Xiao-dan

DOI
https://doi.org/10.11896/jsjkx.220600009
Journal volume & issue
Vol. 49, no. 10
pp. 191 – 197

Abstract

Read online

In recent years,the encoder-decoder framework based on self-attention mechanism has become the mainstream model in image captioning.However,self-attention in the encoder only models the visual relations of low-scale features,ignoring some effective information in high-scale visual features,thus affecting the quality of the generated descriptions.To solve this problem,this paper proposes a cross-scale feature fusion self-attention(CFFSA) method for image captioning.Specifically,CFFSA integrates low-scale and high-scale visual features in self-attention to improve the range of attention from a visual perspective,which increases effective visual information and reduces noise,thereby learning more accurate visual and semantic relationships.Experiments on MS COCO dataset show that the proposed method can more accurately capture the relationship between cross-scale visual features and generate more accurate descriptions.In addition,CFFSA is a general method,which can further improve the performance of the model by combining with other self-attention based image captioning methods.

Keywords