Intelligent Systems with Applications (May 2023)
Contrastive training of a multimodal encoder for medical visual question answering
Abstract
Models for Visual Question Answering (VQA) on medical images aim to answer diagnostically relevant natural language questions with basis on visual contents. In this article, we propose a novel approach to address this problem, which combines a strong image encoder based on EfficientNetV2 with a multimodal encoder based on the RealFormer architecture. Our model is pre-trained through a strategy that includes a contrastive objective, and the final fine-tuning to the VQA task uses a loss function that specifically addresses class imbalance. The experimental results confirm the effectiveness of our approach on the VQA-Med dataset from ImageCLEF 2019, showcasing the potential benefits of combining multimodal pre-training with recent advances in terms of neural network architectures.