IEEE Access (Jan 2023)

MediBioDeBERTa: Biomedical Language Model With Continuous Learning and Intermediate Fine-Tuning

  • Eunhui Kim,
  • Yuna Jeong,
  • Myung-Seok Choi

DOI
https://doi.org/10.1109/ACCESS.2023.3341612
Journal volume & issue
Vol. 11
pp. 141036 – 141044

Abstract

Read online

The emergence of large language models (LLMs) has marked a significant milestone in the evolution of natural language processing. With the expanded use of LLMs in multiple fields, the development of domain-specific pre-trained language models (PLMs) has become a natural progression and requirement. Developing domain-specific PLMs requires careful design, considering not only differences in training methods but also various factors such as the type of training data and hyperparameters. This paper proposes MediBioDeBERTa, a specialized language model (LM) for biomedical applications. First, we present several practical analyses and methods for improving the performance of LMs in specialized domains. As the initial step, we developed SciDeBERTa v2, an LM specialized in the scientific domain. In the SciERC dataset evaluation, SciDeBERTa v2 achieves the state-of-the-art model performance in the named entity recognition (NER) task. We then provide an in-depth analysis of the datasets and training methods used in the biomedical field. Based on these analyses, MediBioDeBERTa, was continually trained on SciDeBERTa v2 to specialize in the biomedical domain. Utilizing the biomedical language understanding and reasoning benchmark (BLURB), we analyzed factors that degrade task performance and proposed additional improvement methods based on intermediate fine-tuning. The results demonstrate improved performance in three categories: named entity recognition (NER), semantic similarity (SS), and question-answering (QnA), as well as in the ChemProt relation extraction (RE) task on BLURB, compared with existing state-of-the-art LMs.

Keywords