Towards building multilingual language model for medicine

Pengcheng Qiu; Chaoyi Wu; Xiaoman Zhang; Weixiong Lin; Haicheng Wang; Ya Zhang; Yanfeng Wang; Weidi Xie

doi:10.1038/s41467-024-52417-z

Nature Communications (Sep 2024)

Towards building multilingual language model for medicine

Pengcheng Qiu,
Chaoyi Wu,
Xiaoman Zhang,
Weixiong Lin,
Haicheng Wang,
Ya Zhang,
Yanfeng Wang,
Weidi Xie

Affiliations

Pengcheng Qiu: Shanghai Jiao Tong University
Chaoyi Wu: Shanghai Jiao Tong University
Xiaoman Zhang: Shanghai Jiao Tong University
Weixiong Lin: Shanghai Jiao Tong University
Haicheng Wang: Shanghai Jiao Tong University
Ya Zhang: Shanghai Jiao Tong University
Yanfeng Wang: Shanghai Jiao Tong University
Weidi Xie: Shanghai Jiao Tong University

DOI: https://doi.org/10.1038/s41467-024-52417-z
Journal volume & issue: Vol. 15, no. 1
pp. 1 – 15

Abstract

Read online

Abstract The development of open-source, multilingual medical language models can benefit a wide, linguistically diverse audience from different regions. To promote this domain, we present contributions from the following: First, we construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, enabling auto-regressive domain adaptation for general LLMs; Second, to monitor the development of multilingual medical LLMs, we propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; Third, we have assessed a number of open-source large language models (LLMs) on our benchmark, along with those further auto-regressive trained on MMedC. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks, even rivaling GPT-4. In conclusion, in this work, We present a large-scale corpus, a benchmark and a series of models to support the development of multilingual medical LLMs.

Published in Nature Communications

ISSN: 2041-1723 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/ncomms/

About the journal