Applied Sciences (Aug 2024)
MédicoBERT: A Medical Language Model for Spanish Natural Language Processing Tasks with a Question-Answering Application Using Hyperparameter Optimization
Abstract
The increasing volume of medical information available in digital format presents a significant challenge for researchers seeking to extract relevant information. Manually analyzing voluminous data is a time-consuming process that constrains researchers’ productivity. In this context, innovative and intelligent computational approaches to information search, such as large language models (LLMs), offer a promising solution. LLMs understand natural language questions and respond accurately to complex queries, even in the specialized domain of medicine. This paper presents MédicoBERT, a medical language model in Spanish developed by adapting a general domain language model (BERT) to medical terminology and vocabulary related to diseases, treatments, symptoms, and medications. The model was pre-trained with 3 M medical texts containing 1.1 B words. Furthermore, with promising results, MédicoBERT was adapted and evaluated to answer medical questions in Spanish. The question-answering (QA) task was fine-tuned using a Spanish corpus of over 34,000 medical questions and answers. A search was then conducted to identify the optimal hyperparameter configuration using heuristic methods and nonlinear regression models. The evaluation of MédicoBERT was carried out using metrics such as perplexity to measure the adaptation of the language model to the medical vocabulary in Spanish, where it obtained a value of 4.28, and the average F1 metric for the task of answering medical questions, where it obtained a value of 62.35%. The objective of MédicoBERT is to provide support for research in the field of natural language processing (NLP) in Spanish, with a particular emphasis on applications within the medical domain.
Keywords