Medical Image Captioning Model to Convey More Details: Methodological Comparison of Feature Difference Generation

Hyeryun Park; Kyungmo Kim; Seongkeun Park; Jinwook Choi

doi:10.1109/ACCESS.2021.3124564

IEEE Access (Jan 2021)

Medical Image Captioning Model to Convey More Details: Methodological Comparison of Feature Difference Generation

Hyeryun Park,
Kyungmo Kim,
Seongkeun Park,
Jinwook Choi

Affiliations

Hyeryun Park: ORCiD; Interdisciplinary Program for Bioengineering, Graduate School, Seoul National University, Seoul, South Korea
Kyungmo Kim: Interdisciplinary Program for Bioengineering, Graduate School, Seoul National University, Seoul, South Korea
Seongkeun Park: ORCiD; Department of Biomedical Engineering, College of Medicine, Seoul National University, Seoul, South Korea
Jinwook Choi: ORCiD; Department of Biomedical Engineering, College of Medicine, Seoul National University, Seoul, South Korea

DOI: https://doi.org/10.1109/ACCESS.2021.3124564
Journal volume & issue: Vol. 9
pp. 150560 – 150568

Abstract

Read online

The steadily increasing number of medical images places a tremendous burden on doctors, who toned to read and write reports. If an image captioning model could generate drafts of the reports from the corresponding images, the workload of doctors would be reduced, thereby saving time and expenses. The aim of this study was to develop a chest x-ray image captioning model that considers the differences between patient images and normal images, and uses hierarchical long short-term memory (LSTM) or a transformer as a decoder to generate reports. We investigated which feature representation method was the most appropriate for capturing the differences. The feature representations differed in terms of whether global average pooling was used for the visual feature vectors and how the feature difference vectors were generated. Experiments were conducted on two datasets using the proposed models and recent captioning models (X-LAN and X-Transformer). BLEU, METEOR, ROUGE-L, and CIDEr were used as evaluation metrics. The best model for most metric scores was the multi-difference non-average-pooling transformer model, which uses the transformer decoder, does not use global average pooling for the visual feature vectors, and applies the element-wise product. The transformer decoder was found to be more suitable than hierarchical LSTM. Furthermore, for models that do not condense features with global average pooling, the element-wise product was observed to be more effective than subtraction in expressing the feature differences.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords