Visual Intelligence (Dec 2024)

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

  • Zhangwei Gao,
  • Zhe Chen,
  • Erfei Cui,
  • Yiming Ren,
  • Weiyun Wang,
  • Jinguo Zhu,
  • Hao Tian,
  • Shenglong Ye,
  • Junjun He,
  • Xizhou Zhu,
  • Lewei Lu,
  • Tong Lu,
  • Yu Qiao,
  • Jifeng Dai,
  • Wenhai Wang

DOI
https://doi.org/10.1007/s44267-024-00067-6
Journal volume & issue
Vol. 2, no. 1
pp. 1 – 17

Abstract

Read online

Abstract Multi-modal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a wide range of domains. However, the large model scale and associated high computational cost pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1 billion to 4 billion, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. To further promote the adoption of our models, we are developing a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical image processing, and remote sensing. We believe that our models can provide valuable insights and resources to advance the development of efficient and effective MLLMs.

Keywords