Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

Nona Naderi; Nona Naderi; Julien Knafou; Julien Knafou; Julien Knafou; Jenny Copara; Jenny Copara; Jenny Copara; Patrick Ruch; Patrick Ruch; Douglas Teodoro; Douglas Teodoro; Douglas Teodoro

doi:10.3389/frma.2021.689803

Frontiers in Research Metrics and Analytics (Nov 2021)

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora

Nona Naderi,
Nona Naderi,
Julien Knafou,
Julien Knafou,
Julien Knafou,
Jenny Copara,
Jenny Copara,
Jenny Copara,
Patrick Ruch,
Patrick Ruch,
Douglas Teodoro,
Douglas Teodoro,
Douglas Teodoro

Affiliations

Nona Naderi: Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland
Nona Naderi: Swiss Institute of Bioinformatics, Geneva, Switzerland
Julien Knafou: Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland
Julien Knafou: Swiss Institute of Bioinformatics, Geneva, Switzerland
Julien Knafou: Computer Science Department, University of Geneva, Geneva, Switzerland
Jenny Copara: Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
Jenny Copara: Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland
Jenny Copara: Swiss Institute of Bioinformatics, Geneva, Switzerland
Patrick Ruch: Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland
Patrick Ruch: Swiss Institute of Bioinformatics, Geneva, Switzerland
Douglas Teodoro: Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
Douglas Teodoro: Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland
Douglas Teodoro: Swiss Institute of Bioinformatics, Geneva, Switzerland

DOI: https://doi.org/10.3389/frma.2021.689803
Journal volume & issue: Vol. 6

Abstract

Read online

The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains—biology, chemistry, and medicine—available in different languages—English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.

Published in Frontiers in Research Metrics and Analytics

ISSN: 2504-0537 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Bibliography. Library science. Information resources
Website: http://journal.frontiersin.org/journal/research-metrics-and-analytics

About the journal

Abstract

Keywords