Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms

Xiao Yang; Shyamasree Saha; Aravind Venkatesan; Santosh Tirunagari; Vid Vartak; Johanna McEntyre

doi:10.1038/s41597-023-02617-x

Scientific Data (Oct 2023)

Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms

Xiao Yang,
Shyamasree Saha,
Aravind Venkatesan,
Santosh Tirunagari,
Vid Vartak,
Johanna McEntyre

Affiliations

Xiao Yang: Literature Services, EMBL-EBI, Wellcome Trust Genome Campus
Shyamasree Saha: Literature Services, EMBL-EBI, Wellcome Trust Genome Campus
Aravind Venkatesan: Literature Services, EMBL-EBI, Wellcome Trust Genome Campus
Santosh Tirunagari: Literature Services, EMBL-EBI, Wellcome Trust Genome Campus
Vid Vartak: Literature Services, EMBL-EBI, Wellcome Trust Genome Campus
Johanna McEntyre: Literature Services, EMBL-EBI, Wellcome Trust Genome Campus

DOI: https://doi.org/10.1038/s41597-023-02617-x
Journal volume & issue: Vol. 10, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Named entity recognition (NER) is a widely used text-mining and natural language processing (NLP) subtask. In recent years, deep learning methods have superseded traditional dictionary- and rule-based NER approaches. A high-quality dataset is essential to fully leverage recent deep learning advancements. While several gold-standard corpora for biomedical entities in abstracts exist, only a few are based on full-text research articles. The Europe PMC literature database routinely annotates Gene/Proteins, Diseases, and Organisms entities. To transition this pipeline from a dictionary-based to a machine learning-based approach, we have developed a human-annotated full-text corpus for these entities, comprising 300 full-text open-access research articles. Over 72,000 mentions of biomedical concepts have been identified within approximately 114,000 sentences. This article describes the corpus and details how to access and reuse this open community resource.

Published in Scientific Data

ISSN: 2052-4463 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/sdata/

About the journal