Large-scale application of named entity recognition to biomedicine and epidemiology.

Shaina Raza; Deepak John Reji; Femi Shajan; Syed Raza Bashir

doi:10.1371/journal.pdig.0000152

PLOS Digital Health (Dec 2022)

Large-scale application of named entity recognition to biomedicine and epidemiology.

Shaina Raza,
Deepak John Reji,
Femi Shajan,
Syed Raza Bashir

Affiliations

Shaina Raza
Deepak John Reji
Femi Shajan
Syed Raza Bashir

DOI: https://doi.org/10.1371/journal.pdig.0000152
Journal volume & issue: Vol. 1, no. 12
p. e0000152

Abstract

Read online

BackgroundDespite significant advancements in biomedical named entity recognition methods, the clinical application of these systems continues to face many challenges: (1) most of the methods are trained on a limited set of clinical entities; (2) these methods are heavily reliant on a large amount of data for both pre-training and prediction, making their use in production impractical; (3) they do not consider non-clinical entities, which are also related to patient's health, such as social, economic or demographic factors.MethodsIn this paper, we develop Bio-Epidemiology-NER (https://pypi.org/project/Bio-Epidemiology-NER/) an open-source Python package for detecting biomedical named entities from the text. This approach is based on a Transformer-based system and trained on a dataset that is annotated with many named entities (medical, clinical, biomedical, and epidemiological). This approach improves on previous efforts in three ways: (1) it recognizes many clinical entity types, such as medical risk factors, vital signs, drugs, and biological functions; (2) it is easily configurable, reusable, and can scale up for training and inference; (3) it also considers non-clinical factors (age and gender, race and social history and so) that influence health outcomes. At a high level, it consists of the phases: pre-processing, data parsing, named entity recognition, and named entity enhancement.ResultsExperimental results show that our pipeline outperforms other methods on three benchmark datasets with macro-and micro average F1 scores around 90 percent and above.ConclusionThis package is made publicly available for researchers, doctors, clinicians, and anyone to extract biomedical named entities from unstructured biomedical texts.

Published in PLOS Digital Health

ISSN: 2767-3170 (Online)
Publisher: Public Library of Science (PLoS)
Country of publisher: United States
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://journals.plos.org/digitalhealth/

About the journal