IEEE Access (Jan 2022)

Image Captioning Model Using Part-of-Speech Guidance Module for Description With Diverse Vocabulary

  • Ju-Won Bae,
  • Soo-Hwan Lee,
  • Won-Yeol Kim,
  • Ju-Hyeon Seong,
  • Dong-Hoan Seo

DOI
https://doi.org/10.1109/ACCESS.2022.3169781
Journal volume & issue
Vol. 10
pp. 45219 – 45229

Abstract

Read online

Image captions aim to generate human-like sentences that describe the image’s content. Recent developments in deep learning (DL) have made it possible to caption images for accurate descriptions and detailed expressions. However, since DL learns the relationship between images and captions, it constructs sentences based on commonly frequented words in the dataset. Although these generated sentences are highly accurate, they have low lexical diversity, unlike humans due to limited vocabulary. Therefore, in this paper, we propose a Part-Of-Speech (POS) guidance module and a multimodal-based image captioning model that determines the intensity of images and word sequences and generates sentences through POS to enhance the lexical diversity of DL. The proposed POS guidance module enables rich expression by controlling the information of images and sequences based on the predicted POS guidance to predict words. Then, the POS multimodal layer adds POS and output vector of Bi-LSTM using the multimodal layer to predict the next caption, considering the grammatical structure. We trained and tested the proposed model on the Flicker 30K and MS COCO datasets and compared them with current state-of-the-art studies. Also, we analyzed the lexical diversity of the caption model through the Type-Token Ratio (TTR) and confirmed that the proposed model generates sentences using several words.

Keywords