IEEE Access (Jan 2024)
Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
Abstract
In the realm of Visual Question Answering, accurate answers often hinge on the harmonious fusion of textual and visual elements. While these complex architectures are effective, they typically come with a hefty price tag: a large number of parameters that demand significant processing power and lengthy training times. In contrast, traditional Dual-stream approaches prioritize accuracy above all else, neglecting the memory requirements of GPU processing and training time. This paper presents a novel Dual-stream architecture for VQA, whose parameters have been rigorously tested and evaluated not only for performance, but also for GPU memory consumption and training time. The results show that it’s possible to achieve competitive performance while significantly reducing the computational burden typically associated with complex VQA models.
Keywords