Artificial Intelligence in the Life Sciences (Dec 2021)

Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning

  • Talia B. Kimber,
  • Maxime Gagnebin,
  • Andrea Volkamer

Journal volume & issue
Vol. 1
p. 100014

Abstract

Read online

Accurate molecular property or activity prediction is one of the main goals in computer-aided drug design. Quantitative structure-activity relationship (QSAR) modeling and machine learning, more recently deep learning, have become an integral part of this process. Such algorithms require lots of data for training which, in the case of physico-chemical and bioactivity data sets, remains scarce. To address the lack of data, augmentation techniques are increasingly applied in deep learning. Here, we exploit that one compound can be represented by various SMILES strings as means of data augmentation and we explore several augmentation techniques. Convolutional and recurrent neural networks are trained on four data sets, including experimental solubility, lipophilicity, and bioactivity measurements. Moreover, the uncertainty of the models is assessed by applying augmentation on the test set. Our results show that data augmentation improves the accuracy independently of the deep learning model and of the size of the data. The best strategies lead to the Maxsmi models, the models that maximize the performance in SMILES augmentation. Our findings show that the standard deviation of the per SMILES prediction correlates with the accuracy of the associated compound prediction. In addition, our systematic testing of different augmentation strategies provides an extensive guideline to SMILES augmentation. A prediction tool using the Maxsmi models for novel compounds on the aforementioned physico-chemical and bioactivity tasks is made available at https://github.com/volkamerlab/maxsmi.

Keywords