Named entity recognition of pharmacokinetic parameters in the scientific literature

Ferran Gonzalez Hernandez; Quang Nguyen; Victoria C. Smith; José Antonio Cordero; Maria Rosa Ballester; Màrius Duran; Albert Solé; Palang Chotsiri; Thanaporn Wattanakul; Gill Mundin; Watjana Lilaonitkul; Joseph F. Standing; Frank Kloprogge

doi:10.1038/s41598-024-73338-3

Scientific Reports (Oct 2024)

Named entity recognition of pharmacokinetic parameters in the scientific literature

Ferran Gonzalez Hernandez,
Quang Nguyen,
Victoria C. Smith,
José Antonio Cordero,
Maria Rosa Ballester,
Màrius Duran,
Albert Solé,
Palang Chotsiri,
Thanaporn Wattanakul,
Gill Mundin,
Watjana Lilaonitkul,
Joseph F. Standing,
Frank Kloprogge

Affiliations

Ferran Gonzalez Hernandez: Department of Computer Science, University College London
Quang Nguyen: Institute of Health Informatics, University College London
Victoria C. Smith: Institute of Health Informatics, University College London
José Antonio Cordero: Blanquerna School of Health Sciences, Ramon Llull University
Maria Rosa Ballester: Blanquerna School of Health Sciences, Ramon Llull University
Màrius Duran: Blanquerna School of Health Sciences, Ramon Llull University
Albert Solé: Blanquerna School of Health Sciences, Ramon Llull University
Palang Chotsiri: Mahidol Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine, Mahidol University
Thanaporn Wattanakul: Mahidol Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine, Mahidol University
Gill Mundin: Department of Computer Science, University College London
Watjana Lilaonitkul: Global Business School for Health, University College London
Joseph F. Standing: Great Ormond Street Institute for Child Health, University College London
Frank Kloprogge: Institute for Global Health, University College London

DOI: https://doi.org/10.1038/s41598-024-73338-3
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 8

Abstract

Read online

Abstract The development of accurate predictions for a new drug’s absorption, distribution, metabolism, and excretion profiles in the early stages of drug development is crucial due to high candidate failure rates. The absence of comprehensive, standardised, and updated pharmacokinetic (PK) repositories limits pre-clinical predictions and often requires searching through the scientific literature for PK parameter estimates from similar compounds. While text mining offers promising advancements in automatic PK parameter extraction, accurate Named Entity Recognition (NER) of PK terms remains a bottleneck due to limited resources. This work addresses this gap by introducing novel corpora and language models specifically designed for effective NER of PK parameters. Leveraging active learning approaches, we developed an annotated corpus containing over 4000 entity mentions found across the PK literature on PubMed. To identify the most effective model for PK NER, we fine-tuned and evaluated different NER architectures on our corpus. Fine-tuning BioBERT exhibited the best results, achieving a strict $$F_{1}$$ F 1 score of 90.37% in recognising PK parameter mentions, significantly outperforming heuristic approaches and models trained on existing corpora. To accelerate the development of end-to-end PK information extraction pipelines and improve pre-clinical PK predictions, the PK NER models and the labelled corpus were released open source at https://github.com/PKPDAI/PKNER .

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal