Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

Melanie Vollmar; Santosh Tirunagari; Deborah Harrus; David Armstrong; Romana Gáborová; Deepti Gupta; Marcelo Querino Lima Afonso; Genevieve Evans; Sameer Velankar

doi:10.1038/s41597-024-03841-9

Scientific Data (Sep 2024)

Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

Melanie Vollmar,
Santosh Tirunagari,
Deborah Harrus,
David Armstrong,
Romana Gáborová,
Deepti Gupta,
Marcelo Querino Lima Afonso,
Genevieve Evans,
Sameer Velankar

Affiliations

Melanie Vollmar: Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton
Santosh Tirunagari: Literature Services, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton
Deborah Harrus: Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton
David Armstrong: Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton
Romana Gáborová: CEITEC - Central European Institute of Technology, Masaryk University
Deepti Gupta: Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton
Marcelo Querino Lima Afonso: Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton
Genevieve Evans: Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton
Sameer Velankar: Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton

DOI: https://doi.org/10.1038/s41597-024-03841-9
Journal volume & issue: Vol. 11, no. 1
pp. 1 – 18

Abstract

Read online

Abstract We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.

Published in Scientific Data

ISSN: 2052-4463 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/sdata/

About the journal