Generating image descriptions with multidirectional 2D long short‐term memory

Shuohao Li; Jun Zhang; Qiang Guo; Jun Lei; Dan Tu

doi:10.1049/iet-cvi.2015.0473

IET Computer Vision (Feb 2017)

Generating image descriptions with multidirectional 2D long short‐term memory

Shuohao Li,
Jun Zhang,
Qiang Guo,
Jun Lei,
Dan Tu

Affiliations

Shuohao Li: College of Information System and ManagementNational University of Defense TechnologyNo. 109, Deya RoadChangshaPeople's Republic of China
Jun Zhang: College of Information System and ManagementNational University of Defense TechnologyNo. 109, Deya RoadChangshaPeople's Republic of China
Qiang Guo: College of Information System and ManagementNational University of Defense TechnologyNo. 109, Deya RoadChangshaPeople's Republic of China
Jun Lei: College of Information System and ManagementNational University of Defense TechnologyNo. 109, Deya RoadChangshaPeople's Republic of China
Dan Tu: College of Information System and ManagementNational University of Defense TechnologyNo. 109, Deya RoadChangshaPeople's Republic of China

DOI: https://doi.org/10.1049/iet-cvi.2015.0473
Journal volume & issue: Vol. 11, no. 1
pp. 104 – 111

Abstract

Read online

Connecting visual imagery with descriptive language is a challenge for computer vision and machine translation. To approach this problem, the authors propose a novel end‐to‐end model to generate descriptions for images. Some early works used convolutional neural network‐long‐short‐term memory (CNN‐LSTM) model to describe the image, where a CNN encodes the input image into feature vector and an LSTM decodes the feature vector into a description. Since two‐dimensional LSTM (2DLSTM) has property of translation invariance and can encode the relationships between regions in an image, they not only apply a CNN to extract global features of an image, but also use a multidirectional 2DLSTM to encode the feature maps extracted by CNN into structural local features. Their model is trained through maximising the likelihood of the target description sentence from the training dataset. Experiments on two challenging datasets show the accuracy of the model and the fluency of the language which is learned by their model. They compare bilingual evaluation understudy score and retrieval metric of their results with current state‐of‐the‐art scores and show the improvements on Flickr30k and MS COCO.

Published in IET Computer Vision

ISSN: 1751-9632 (Print); 1751-9640 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/17519640

About the journal

Abstract

Keywords