Pushing the limits of solubility prediction via quality-oriented data selection

Murat Cihan Sorkun; J.M. Vianney A. Koelman; Süleyman Er

iScience (Jan 2021)

Pushing the limits of solubility prediction via quality-oriented data selection

Murat Cihan Sorkun,
J.M. Vianney A. Koelman,
Süleyman Er

Affiliations

Murat Cihan Sorkun: DIFFER - Dutch Institute for Fundamental Energy Research, De Zaale 20, 5612 AJ Eindhoven, the Netherlands; CCER - Center for Computational Energy Research, De Zaale 20, 5612 AJ Eindhoven, the Netherlands; Department of Applied Physics, Eindhoven University of Technology, 5600 MB Eindhoven, the Netherlands
J.M. Vianney A. Koelman: DIFFER - Dutch Institute for Fundamental Energy Research, De Zaale 20, 5612 AJ Eindhoven, the Netherlands; CCER - Center for Computational Energy Research, De Zaale 20, 5612 AJ Eindhoven, the Netherlands; Department of Applied Physics, Eindhoven University of Technology, 5600 MB Eindhoven, the Netherlands
Süleyman Er: DIFFER - Dutch Institute for Fundamental Energy Research, De Zaale 20, 5612 AJ Eindhoven, the Netherlands; CCER - Center for Computational Energy Research, De Zaale 20, 5612 AJ Eindhoven, the Netherlands; Corresponding author

Journal volume & issue: Vol. 24, no. 1
p. 101961

Abstract

Read online

Summary: Accurate prediction of the solubility of chemical substances in solvents remains a challenge. The sparsity of high-quality solubility data is recognized as the biggest hurdle in the development of robust data-driven methods for practical use. Nonetheless, the effects of the quality and quantity of data on aqueous solubility predictions have not yet been scrutinized. In this study, the roles of the size and the quality of data sets on the performances of the solubility prediction models are unraveled, and the concepts of actual and observed performances are introduced. In an effort to curtail the gap between actual and observed performances, a quality-oriented data selection method, which evaluates the quality of data and extracts the most accurate part of it through statistical validation, is designed. Applying this method on the largest publicly available solubility database and using a consensus machine learning approach, a top-performing solubility prediction model is achieved.

Published in iScience

ISSN: 2589-0042 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Science
Website: http://www.cell.com/iscience/home

About the journal

Abstract

Keywords