A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence

Xiaofan Zheng; Yoichi Tomiura

doi:10.1186/s13321-024-00848-7

Journal of Cheminformatics (Jun 2024)

A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence

Xiaofan Zheng,
Yoichi Tomiura

Affiliations

Xiaofan Zheng: Graduate School of Information Science and Electrical Engineering, Department of Informatics, Kyushu University
Yoichi Tomiura: Graduate School of Information Science and Electrical Engineering, Department of Informatics, Kyushu University

DOI: https://doi.org/10.1186/s13321-024-00848-7
Journal volume & issue: Vol. 16, no. 1
pp. 1 – 9

Abstract

Read online

Abstract Among the various molecular properties and their combinations, it is a costly process to obtain the desired molecular properties through theory or experiment. Using machine learning to analyze molecular structure features and to predict molecular properties is a potentially efficient alternative for accelerating the prediction of molecular properties. In this study, we analyze molecular properties through the molecular structure from the perspective of machine learning. We use SMILES sequences as inputs to an artificial neural network in extracting molecular structural features and predicting molecular properties. A SMILES sequence comprises symbols representing molecular structures. To address the problem that a SMILES sequence is different from actual molecular structural data, we propose a pretraining model for a SMILES sequence based on the BERT model, which is widely used in natural language processing, such that the model learns to extract the molecular structural information contained in the SMILES sequence. In an experiment, we first pretrain the proposed model with 100,000 SMILES sequences and then use the pretrained model to predict molecular properties on 22 data sets and the odor characteristics of molecules (98 types of odor descriptor). The experimental results show that our proposed pretraining model effectively improves the performance of molecular property prediction Scientific contribution The 2-encoder pretraining is proposed by focusing on the lower dependency of symbols to the contextual environment in a SMILES than one in a natural language sentence and the corresponding of one compound to multiple SMILES sequences. The model pretrained with 2-encoder shows higher robustness in tasks of molecular properties prediction compared to BERT which is adept at natural language.

Published in Journal of Cheminformatics

ISSN: 1758-2946 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Chemistry
Website: https://jcheminf.biomedcentral.com/

About the journal

Abstract

Keywords