IEEE Access (Jan 2021)

Medical Image Captioning Model to Convey More Details: Methodological Comparison of Feature Difference Generation

  • Hyeryun Park,
  • Kyungmo Kim,
  • Seongkeun Park,
  • Jinwook Choi

DOI
https://doi.org/10.1109/ACCESS.2021.3124564
Journal volume & issue
Vol. 9
pp. 150560 – 150568

Abstract

Read online

The steadily increasing number of medical images places a tremendous burden on doctors, who toned to read and write reports. If an image captioning model could generate drafts of the reports from the corresponding images, the workload of doctors would be reduced, thereby saving time and expenses. The aim of this study was to develop a chest x-ray image captioning model that considers the differences between patient images and normal images, and uses hierarchical long short-term memory (LSTM) or a transformer as a decoder to generate reports. We investigated which feature representation method was the most appropriate for capturing the differences. The feature representations differed in terms of whether global average pooling was used for the visual feature vectors and how the feature difference vectors were generated. Experiments were conducted on two datasets using the proposed models and recent captioning models (X-LAN and X-Transformer). BLEU, METEOR, ROUGE-L, and CIDEr were used as evaluation metrics. The best model for most metric scores was the multi-difference non-average-pooling transformer model, which uses the transformer decoder, does not use global average pooling for the visual feature vectors, and applies the element-wise product. The transformer decoder was found to be more suitable than hierarchical LSTM. Furthermore, for models that do not condense features with global average pooling, the element-wise product was observed to be more effective than subtraction in expressing the feature differences.

Keywords