IET Computer Vision (Feb 2017)

Generating image descriptions with multidirectional 2D long short‐term memory

  • Shuohao Li,
  • Jun Zhang,
  • Qiang Guo,
  • Jun Lei,
  • Dan Tu

DOI
https://doi.org/10.1049/iet-cvi.2015.0473
Journal volume & issue
Vol. 11, no. 1
pp. 104 – 111

Abstract

Read online

Connecting visual imagery with descriptive language is a challenge for computer vision and machine translation. To approach this problem, the authors propose a novel end‐to‐end model to generate descriptions for images. Some early works used convolutional neural network‐long‐short‐term memory (CNN‐LSTM) model to describe the image, where a CNN encodes the input image into feature vector and an LSTM decodes the feature vector into a description. Since two‐dimensional LSTM (2DLSTM) has property of translation invariance and can encode the relationships between regions in an image, they not only apply a CNN to extract global features of an image, but also use a multidirectional 2DLSTM to encode the feature maps extracted by CNN into structural local features. Their model is trained through maximising the likelihood of the target description sentence from the training dataset. Experiments on two challenging datasets show the accuracy of the model and the fluency of the language which is learned by their model. They compare bilingual evaluation understudy score and retrieval metric of their results with current state‐of‐the‐art scores and show the improvements on Flickr30k and MS COCO.

Keywords