Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Zhangwei Gao; Zhe Chen; Erfei Cui; Yiming Ren; Weiyun Wang; Jinguo Zhu; Hao Tian; Shenglong Ye; Junjun He; Xizhou Zhu; Lewei Lu; Tong Lu; Yu Qiao; Jifeng Dai; Wenhai Wang

doi:10.1007/s44267-024-00067-6

Visual Intelligence (Dec 2024)

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Zhangwei Gao,
Zhe Chen,
Erfei Cui,
Yiming Ren,
Weiyun Wang,
Jinguo Zhu,
Hao Tian,
Shenglong Ye,
Junjun He,
Xizhou Zhu,
Lewei Lu,
Tong Lu,
Yu Qiao,
Jifeng Dai,
Wenhai Wang

Affiliations

Zhangwei Gao: Shanghai AI Laboratory
Zhe Chen: Shanghai AI Laboratory
Erfei Cui: Shanghai AI Laboratory
Yiming Ren: Shanghai AI Laboratory
Weiyun Wang: Shanghai AI Laboratory
Jinguo Zhu: Shanghai AI Laboratory
Hao Tian: SenseTime Research
Shenglong Ye: Shanghai AI Laboratory
Junjun He: Shanghai AI Laboratory
Xizhou Zhu: Department of Electronic Engineering, Tsinghua University
Lewei Lu: SenseTime Research
Tong Lu: School of Computer Science, Nanjing University
Yu Qiao: Shanghai AI Laboratory
Jifeng Dai: Department of Electronic Engineering, Tsinghua University
Wenhai Wang: Department of Information Engineering, The Chinese University of Hong Kong

DOI: https://doi.org/10.1007/s44267-024-00067-6
Journal volume & issue: Vol. 2, no. 1
pp. 1 – 17

Abstract

Read online

Abstract Multi-modal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a wide range of domains. However, the large model scale and associated high computational cost pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1 billion to 4 billion, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. To further promote the adoption of our models, we are developing a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical image processing, and remote sensing. We believe that our models can provide valuable insights and resources to advance the development of efficient and effective MLLMs.

Published in Visual Intelligence

ISSN: 2097-3330 (Print); 2731-9008 (Online)
Publisher: Springer
Country of publisher: Singapore
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science; Science: Physiology: Neurophysiology and neuropsychology
Website: https://link.springer.com/journal/44267

About the journal

Abstract

Keywords