EAES: Effective Augmented Embedding Spaces for Text-Based Image Captioning

Khang Nguyen; Doanh C. Bui; Truc Trinh; Nguyen D. Vo

doi:10.1109/ACCESS.2022.3158763

IEEE Access (Jan 2022)

EAES: Effective Augmented Embedding Spaces for Text-Based Image Captioning

Khang Nguyen,
Doanh C. Bui,
Truc Trinh,
Nguyen D. Vo

Affiliations

Khang Nguyen: ORCiD; University of Information Technology, Ho Chi Minh City, Vietnam
Doanh C. Bui: ORCiD; University of Information Technology, Ho Chi Minh City, Vietnam
Truc Trinh: ORCiD; University of Information Technology, Ho Chi Minh City, Vietnam
Nguyen D. Vo: ORCiD; University of Information Technology, Ho Chi Minh City, Vietnam

DOI: https://doi.org/10.1109/ACCESS.2022.3158763
Journal volume & issue: Vol. 10
pp. 32443 – 32452

Abstract

Read online

Text-based Image Captioning has been a novel problem since 2020. This topic remains challenging because it requires the model to comprehend not only the visual context but also the scene texts that appear in an image. Therefore, the ways image and scene texts are embedded into the main model for training is crucial. Based on the M4C-Captioner model, this paper proposes the simple but effective EAES embedding module for effectively embedding images and scene texts into the multimodal Transformer layers. In detail, our EAES module contains two significant sub-modules: Objects-augmented and Grid features augmentation. With the Objects-augmented module, we provide the relative geometry feature, representing the relation between objects and between OCR tokens. Furthermore, we extract the grid features for an image with the Grid features augmentation module and combine it with visual objects, which help the model focus on both salient objects and the general context of an image, leading to better performance. We use the TextCaps dataset as the benchmark to prove the effectiveness of our approach on five standard metrics: BLEU4, METEOR, ROUGE-L, SPICE and CIDEr. Without bells and whistles, our method achieves 20.21% on the BLEU4 metric and 85.78% on the CIDEr metric, 1.31% and 4.78% higher, respectively, than the baseline M4C-Captioner method. Furthermore, the results are incredibly competitive with other methods on METEOR, ROUGE-L and SPICE metrics. Source code is available at https://github.com/UIT-Together/EAES_m4c.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords