Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery

Nicholas Aksamit; Alain Tchagang; Yifeng Li; Beatrice Ombuki-Berman

doi:10.1186/s12859-024-05861-z

BMC Bioinformatics (Aug 2024)

Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery

Nicholas Aksamit,
Alain Tchagang,
Yifeng Li,
Beatrice Ombuki-Berman

Affiliations

Nicholas Aksamit: Department of Computer Science, Brock University
Alain Tchagang: Digital Technologies Research Centre, National Research Council Canada
Yifeng Li: Department of Computer Science, Brock University
Beatrice Ombuki-Berman: Department of Computer Science, Brock University

DOI: https://doi.org/10.1186/s12859-024-05861-z
Journal volume & issue: Vol. 25, no. 1
pp. 1 – 25

Abstract

Read online

Abstract Background: Drug discovery and development is the extremely costly and time-consuming process of identifying new molecules that can interact with a biomarker target to interrupt the disease pathway of interest. In addition to binding the target, a drug candidate needs to satisfy multiple properties affecting absorption, distribution, metabolism, excretion, and toxicity (ADMET). Artificial intelligence approaches provide an opportunity to improve each step of the drug discovery and development process, in which the first question faced by us is how a molecule can be informatively represented such that the in-silico solutions are optimized. Results: This study introduces a novel hybrid SMILES-fragment tokenization method, coupled with two pre-training strategies, utilizing a Transformer-based model. We investigate the efficacy of hybrid tokenization in improving the performance of ADMET prediction tasks. Our approach leverages MTL-BERT, an encoder-only Transformer model that achieves state-of-the-art ADMET predictions, and contrasts the standard SMILES tokenization with our hybrid method across a spectrum of fragment library cutoffs. Conclusion: The findings reveal that while an excess of fragments can impede performance, using hybrid tokenization with high frequency fragments enhances results beyond the base SMILES tokenization. This advancement underscores the potential of integrating fragment- and character-level molecular features within the training of Transformer models for ADMET property prediction.

Published in BMC Bioinformatics

ISSN: 1471-2105 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Biology (General)
Website: http://www.biomedcentral.com/bmcbioinformatics/

About the journal

Abstract

Keywords