Digital Chemical Engineering (Mar 2022)
A surrogate model of sigma profile and COSMOSAC activity coefficient predictions of using transformer with SMILES input
Abstract
COSMOSAC is a model that allows apriori predictions of activity coefficients for characterizing solute-solvent interactions. The method requires the input of sigma profile, the charge distribution on the surface of the molecules, which can be obtained through quantum mechanics calculation. Since Sigma profile is a unique function of molecular structure, it is desirable that they can be obtained using a surrogate model of the quantum computation with a molecular description as input. Previously, a model, the Universal Digital Chemical Space (UDCS), that was developed that allowed us to calculate the Sigma profiles used Simplified Molecular-Input-Line-Entry system (SMILES) as input. In this work, an improved version of this approach was developed using a Transformer model to encode the SMILES text string. Successive input elements in the text string, known as K-mers was also encoded and errors of predicted moments of Sigma profiles and prediction of activity coefficient of reference solvents were also considered as in the loss function. Results showed that while the prediction accuracy of Sigma profile (coefficient of determination R2) were not significantly improved, prediction accuracy of the first and second moment, especially the poorer ranked results; as well as the activity coefficients can be significantly improved with the inclusion of higher K-mers. Further improvement can be achieved with the inclusion of activity loss which substantially improved the accuracy of the 5th and 25th percentile of the moment loss and the activity coefficient of the species in n-hexane.