PharmaBench: Enhancing ADMET benchmarks with large language models

Zhangming Niu; Xianglu Xiao; Wenfan Wu; Qiwei Cai; Yinghui Jiang; Wangzhen Jin; Minhao Wang; Guojian Yang; Lingkang Kong; Xurui Jin; Guang Yang; Hongming Chen

doi:10.1038/s41597-024-03793-0

Scientific Data (Sep 2024)

PharmaBench: Enhancing ADMET benchmarks with large language models

Zhangming Niu,
Xianglu Xiao,
Wenfan Wu,
Qiwei Cai,
Yinghui Jiang,
Wangzhen Jin,
Minhao Wang,
Guojian Yang,
Lingkang Kong,
Xurui Jin,
Guang Yang,
Hongming Chen

Affiliations

Zhangming Niu: MindRank AI
Xianglu Xiao: MindRank AI
Wenfan Wu: MindRank AI
Qiwei Cai: MindRank AI
Yinghui Jiang: MindRank AI
Wangzhen Jin: MindRank AI
Minhao Wang: MindRank AI
Guojian Yang: MindRank AI
Lingkang Kong: MindRank AI
Xurui Jin: MindRank AI
Guang Yang: National Heart and Lung Institute, Imperial College London
Hongming Chen: Department of Bioinformatics and Systems Biology, Huazhong University of Science and Technology College of Life Sciences and Technology

DOI: https://doi.org/10.1038/s41597-024-03793-0
Journal volume & issue: Vol. 11, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Accurately predicting ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties early in drug development is essential for selecting compounds with optimal pharmacokinetics and minimal toxicity. Existing ADMET-related benchmark sets are limited in utility due to their small dataset sizes and the lack of representation of compounds used in drug discovery projects. These shortcomings hinder their application in model building for drug discovery. To address this issue, we propose a multi-agent data mining system based on Large Language Models that effectively identifies experimental conditions within 14,401 bioassays. This approach facilitates merging entries from different sources, culminating in the creation of PharmaBench. Additionally, we have developed a data processing workflow to integrate data from various sources, resulting in 156,618 raw entries. Through this workflow, we constructed PharmaBench, a comprehensive benchmark set for ADMET properties, which comprises eleven ADMET datasets and 52,482 entries. This benchmark set is designed to serve as an open-source dataset for the development of AI models relevant to drug discovery projects.

Published in Scientific Data

ISSN: 2052-4463 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/sdata/

About the journal