Journal of the Serbian Chemical Society (Apr 2010)
The importance of the accuracy of the experimental data for the prediction of solubility
Abstract
Aqueous solubility is an important factor influencing several aspects of the pharmacokinetic profile of a drug. Numerous publications present different methodologies for the development of reliable computational models for the prediction of solubility from structure. The quality of such models can be significantly affected by the accuracy of the employed experimental solubility data. In this work, the importance of the accuracy of the experimental solubility data used for model training was investigated. Three data sets were used as training sets – data set 1, containing solubility data collected from various literature sources using a few criteria (n = 319), data set 2, created by substituting 28 values from data set 1 with uniformly determined experimental data from one laboratory (n = 319), and data set 3, created by including 56 additional components, for which the solubility was also determined under uniform conditions in the same laboratory, in the data set 2 (n = 375). The selection of the most significant descriptors was performed by the heuristic method, using one-parameter and multi-parameter analysis. The correlations between the most significant descriptors and solubility were established using multi-linear regression analysis (MLR) for all three investigated data sets. Notable differences were observed between the equations corresponding to different data sets, suggesting that models updated with new experimental data need to be additionally optimized. It was successfully shown that the inclusion of uniform experimental data consistently leads to an improvement in the correlation coefficients. These findings contribute to an emerging consensus that improving the reliability of solubility prediction requires the inclusion of many diverse compounds for which solubility was measured under standardized conditions in the data set.