Many heads but one brain: FusionBrain – a single multimodal multitask architecture and a competition

D.D. Bakshandaeva; D.V. Dimitrov; V.S. Arkhipkin; A.V. Shonenkov; M.S. Potanin; D.K. Karachev; A.V. Kuznetsov; A.D. Voronov; A.A. Petiushko; V.F. Davydova; E.V. Tutubalina

doi:10.18287/2412-6179-co-1220

Компьютерная оптика (Feb 2023)

Many heads but one brain: FusionBrain – a single multimodal multitask architecture and a competition

D.D. Bakshandaeva,
D.V. Dimitrov,
V.S. Arkhipkin,
A.V. Shonenkov,
M.S. Potanin,
D.K. Karachev,
A.V. Kuznetsov,
A.D. Voronov,
A.A. Petiushko,
V.F. Davydova,
E.V. Tutubalina

Affiliations

D.D. Bakshandaeva: Sber AI; University of Helsinki
D.V. Dimitrov: Sber AI; Artificial Intelligence Research Institute; Moscow State University
V.S. Arkhipkin: Sber AI
A.V. Shonenkov: Artificial Intelligence Research Institute
M.S. Potanin: Artificial Intelligence Research Institute
D.K. Karachev: Artificial Intelligence Research Institute
A.V. Kuznetsov: Sber AI; Artificial Intelligence Research Institute; Samara National Research University
A.D. Voronov: Artificial Intelligence Research Institute
A.A. Petiushko: Artificial Intelligence Research Institute
V.F. Davydova: Sber AI
E.V. Tutubalina: Sber AI; Artificial Intelligence Research Institute; National Research University Higher School of Economics

DOI: https://doi.org/10.18287/2412-6179-co-1220
Journal volume & issue: Vol. 47, no. 1
pp. 185 – 195

Abstract

Read online

Supporting the current trend in the AI community, we present the AI Journey 2021 Challenge called FusionBrain, the first competition which is targeted to make a universal architecture which could process different modalities (in this case, images, texts, and code) and solve multiple tasks for vision and language. The FusionBrain Challenge combines the following specific tasks: Code2code Translation, Handwritten Text recognition, Zero-shot Object Detection, and Visual Question Answering. We have created datasets for each task to test the participants' submissions on it. Moreover, we have collected and made publicly available a new handwritten dataset in both English and Russian, which consists of 94,128 pairs of images and texts. We also propose a multimodal and multitask architecture – a baseline solution, in the centre of which is a frozen foundation model and which has been trained in Fusion mode along with Single-task mode. The proposed Fusion approach proves to be competitive and more energy-efficient compared to the task-specific one.

Published in Компьютерная оптика

ISSN: 0134-2452 (Print); 2412-6179 (Online)
Publisher: Samara National Research University
Country of publisher: Russian Federation
LCC subjects: Science: Science (General): Cybernetics: Information theory; Science: Physics: Optics. Light
Website: http://computeroptics.ru/eng/index.html

About the journal

Abstract

Keywords