A Sparse Transformer-Based Approach for Image Captioning

Zhou Lei; Congcong Zhou; Shengbo Chen; Yiyong Huang; Xianrui Liu

doi:10.1109/ACCESS.2020.3024639

IEEE Access (Jan 2020)

A Sparse Transformer-Based Approach for Image Captioning

Zhou Lei,
Congcong Zhou,
Shengbo Chen,
Yiyong Huang,
Xianrui Liu

Affiliations

Zhou Lei: School of Computer Engineering and Science, Shanghai University, Shanghai, China
Congcong Zhou: ORCiD; School of Computer Engineering and Science, Shanghai University, Shanghai, China
Shengbo Chen: School of Computer Engineering and Science, Shanghai University, Shanghai, China
Yiyong Huang: School of Computer Engineering and Science, Shanghai University, Shanghai, China
Xianrui Liu: ORCiD; School of Computer Engineering and Science, Shanghai University, Shanghai, China

DOI: https://doi.org/10.1109/ACCESS.2020.3024639
Journal volume & issue: Vol. 8
pp. 213437 – 213446

Abstract

Read online

Image Captioning is the task of providing a natural language description for an image. It has caught significant amounts of attention from both computer vision and natural language processing communities. Most image captioning models adopt deep encoder-decoder architectures to achieve state-of-the-art performances. However, it is difficult to model knowledge on relationships between input image region pairs in the encoder. Furthermore, the word in the decoder hardly knows the correlation to specific image regions. In this article, a novel deep encoder-decoder model is proposed for image captioning which is developed on sparse Transformer framework. The encoder adopts a multi-level representation of image features based on self-attention to exploit low-level and high-level features, naturally the correlations between image region pairs are adequately modeled as self-attention operation can be seen as a way of encoding pairwise relationships. The decoder improves the concentration of multi-head self-attention on the global context by explicitly selecting the most relevant segments at each row of the attention matrix. It can help the model focus on the more contributing image regions and generate more accurate words in the context. Experiments demonstrate that our model outperforms previous methods and achieves higher performance on MSCOCO and Flickr30k datasets. Our code is available at https://github.com/2014gaokao/ImageCaptioning.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords