Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration

Xiao-Chen Zhang; Cheng-Kun Wu; Jia-Cai Yi; Xiang-Xiang Zeng; Can-Qun Yang; Ai-Ping Lu; Ting-Jun Hou; Dong-Sheng Cao

doi:10.34133/research.0004

Research (Jan 2022)

Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration

Xiao-Chen Zhang,
Cheng-Kun Wu,
Jia-Cai Yi,
Xiang-Xiang Zeng,
Can-Qun Yang,
Ai-Ping Lu,
Ting-Jun Hou,
Dong-Sheng Cao

Affiliations

Xiao-Chen Zhang: Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, P. R. China.
Cheng-Kun Wu: College of Computer, National University of Defense Technology, Changsha 410005, Hunan, P. R. China.
Jia-Cai Yi: College of Computer, National University of Defense Technology, Changsha 410005, Hunan, P. R. China.
Xiang-Xiang Zeng: Department of Computer Science, Hunan University, Changsha 410082, Hunan, P. R. China.
Can-Qun Yang: College of Computer, National University of Defense Technology, Changsha 410005, Hunan, P. R. China.
Ai-Ping Lu: Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR 999077, P. R. China.
Ting-Jun Hou: Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China.
Dong-Sheng Cao: Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, P. R. China.

DOI: https://doi.org/10.34133/research.0004
Journal volume & issue: Vol. 2022

Abstract

Read online

Accurate prediction of pharmacological properties of small molecules is becoming increasingly important in drug discovery. Traditional feature-engineering approaches heavily rely on handcrafted descriptors and/or fingerprints, which need extensive human expert knowledge. With the rapid progress of artificial intelligence technology, data-driven deep learning methods have shown unparalleled advantages over feature-engineering-based methods. However, existing deep learning methods usually suffer from the scarcity of labeled data and the inability to share information between different tasks when applied to predicting molecular properties, thus resulting in poor generalization capability. Here, we proposed a novel multitask learning BERT (Bidirectional Encoder Representations from Transformer) framework, named MTL-BERT, which leverages large-scale pre-training, multitask learning, and SMILES (simplified molecular input line entry specification) enumeration to alleviate the data scarcity problem. MTL-BERT first exploits a large amount of unlabeled data through self-supervised pretraining to mine the rich contextual information in SMILES strings and then fine-tunes the pretrained model for multiple downstream tasks simultaneously by leveraging their shared information. Meanwhile, SMILES enumeration is used as a data enhancement strategy during the pretraining, fine-tuning, and test phases to substantially increase data diversity and help to learn the key relevant patterns from complex SMILES strings. The experimental results showed that the pretrained MTL-BERT model with few additional fine-tuning can achieve much better performance than the state-of-the-art methods on most of the 60 practical molecular datasets. Additionally, the MTL-BERT model leverages attention mechanisms to focus on SMILES character features essential to target properties for model interpretability.

Published in Research

ISSN: 2096-5168 (Print); 2639-5274 (Online)
Publisher: American Association for the Advancement of Science (AAAS)
Country of publisher: United States
LCC subjects: Science
Website: https://spj.science.org/journal/research

About the journal