Image Captioning Model Using Part-of-Speech Guidance Module for Description With Diverse Vocabulary

Ju-Won Bae; Soo-Hwan Lee; Won-Yeol Kim; Ju-Hyeon Seong; Dong-Hoan Seo

doi:10.1109/ACCESS.2022.3169781

IEEE Access (Jan 2022)

Image Captioning Model Using Part-of-Speech Guidance Module for Description With Diverse Vocabulary

Ju-Won Bae,
Soo-Hwan Lee,
Won-Yeol Kim,
Ju-Hyeon Seong,
Dong-Hoan Seo

Affiliations

Ju-Won Bae: ORCiD; Department of Electronics & Electrical Engineering, Interdisciplinary Major of Maritime AI Convergence, Korea Maritime and Ocean University, Busan, South Korea
Soo-Hwan Lee: ORCiD; Department of Electronics & Electrical Engineering, Interdisciplinary Major of Maritime AI Convergence, Korea Maritime and Ocean University, Busan, South Korea
Won-Yeol Kim: Artificial Intelligence Convergence Research Center for Regional Innovation, Korea Maritime & Ocean University, Busan, South Korea
Ju-Hyeon Seong: ORCiD; Department of Liberal Education, Interdisciplinary Major of Maritime AI Convergence, Korea Maritime & Ocean University, Busan, South Korea
Dong-Hoan Seo: ORCiD; Division of Electronics and Electrical Information Engineering, Interdisciplinary Major of Maritime AI Convergence, Korea Maritime & Ocean University, Busan, South Korea

DOI: https://doi.org/10.1109/ACCESS.2022.3169781
Journal volume & issue: Vol. 10
pp. 45219 – 45229

Abstract

Read online

Image captions aim to generate human-like sentences that describe the image’s content. Recent developments in deep learning (DL) have made it possible to caption images for accurate descriptions and detailed expressions. However, since DL learns the relationship between images and captions, it constructs sentences based on commonly frequented words in the dataset. Although these generated sentences are highly accurate, they have low lexical diversity, unlike humans due to limited vocabulary. Therefore, in this paper, we propose a Part-Of-Speech (POS) guidance module and a multimodal-based image captioning model that determines the intensity of images and word sequences and generates sentences through POS to enhance the lexical diversity of DL. The proposed POS guidance module enables rich expression by controlling the information of images and sequences based on the predicted POS guidance to predict words. Then, the POS multimodal layer adds POS and output vector of Bi-LSTM using the multimodal layer to predict the next caption, considering the grammatical structure. We trained and tested the proposed model on the Flicker 30K and MS COCO datasets and compared them with current state-of-the-art studies. Also, we analyzed the lexical diversity of the caption model through the Type-Token Ratio (TTR) and confirmed that the proposed model generates sentences using several words.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords