Applied Sciences (Jun 2022)

Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge Bases

  • Stefano Silvestri,
  • Francesco Gargiulo,
  • Mario Ciampi

DOI
https://doi.org/10.3390/app12125775
Journal volume & issue
Vol. 12, no. 12
p. 5775

Abstract

Read online

The large availability of clinical natural language documents, such as clinical narratives or diagnoses, requires the definition of smart automatic systems for their processing and analysis, but the lack of annotated corpora in the biomedical domain, especially in languages different from English, makes it difficult to exploit the state-of-art machine-learning systems to extract information from such kinds of documents. For these reasons, healthcare professionals lose big opportunities that can arise from the analysis of this data. In this paper, we propose a methodology to reduce the manual efforts needed to annotate a biomedical named entity recognition (B-NER) corpus, exploiting both active learning and distant supervision, respectively based on deep learning models (e.g., Bi-LSTM, word2vec FastText, ELMo and BERT) and biomedical knowledge bases, in order to speed up the annotation task and limit class imbalance issues. We assessed this approach by creating an Italian-language electronic health record corpus annotated with biomedical domain entities in a small fraction of the time required for a fully manual annotation. The obtained corpus was used to train a B-NER deep neural network whose performances are comparable with the state of the art, with an F1-Score equal to 0.9661 and 0.8875 on two test sets.

Keywords