Accelerating Formulation Design via Machine Learning: Generating a High-throughput Shampoo Formulations Dataset

Aniket Chitre; Robert C. M. Querimit; Simon D. Rihm; Dogancan Karan; Benchuan Zhu; Ke Wang; Long Wang; Kedar Hippalgaonkar; Alexei A. Lapkin

doi:10.1038/s41597-024-03573-w

Scientific Data (Jul 2024)

Accelerating Formulation Design via Machine Learning: Generating a High-throughput Shampoo Formulations Dataset

Aniket Chitre,
Robert C. M. Querimit,
Simon D. Rihm,
Dogancan Karan,
Benchuan Zhu,
Ke Wang,
Long Wang,
Kedar Hippalgaonkar,
Alexei A. Lapkin

Affiliations

Aniket Chitre: Department of Chemical Engineering and Biotechnology, University of Cambridge
Robert C. M. Querimit: Institute of Materials Research and Engineering, Agency for Science, Technology and Research (A*STAR)
Simon D. Rihm: Department of Chemical Engineering and Biotechnology, University of Cambridge
Dogancan Karan: Cambridge Centre for Advanced Research and Education in Singapore, CARES Ltd. 1 CREATE Way
Benchuan Zhu: BASF Advanced Chemicals Co. Ltd.
Ke Wang: BASF Advanced Chemicals Co. Ltd.
Long Wang: BASF Advanced Chemicals Co. Ltd.
Kedar Hippalgaonkar: Institute of Materials Research and Engineering, Agency for Science, Technology and Research (A*STAR)
Alexei A. Lapkin: Department of Chemical Engineering and Biotechnology, University of Cambridge

DOI: https://doi.org/10.1038/s41597-024-03573-w
Journal volume & issue: Vol. 11, no. 1
pp. 1 – 10

Abstract

Read online

Abstract Liquid formulations are ubiquitous yet have lengthy product development cycles owing to the complex physical interactions between ingredients making it difficult to tune formulations to customer-defined property targets. Interpolative ML models can accelerate liquid formulations design but are typically trained on limited sets of ingredients and without any structural information, which limits their out-of-training predictive capacity. To address this challenge, we selected eighteen formulation ingredients covering a diverse chemical space to prepare an open experimental dataset for training ML models for rinse-off formulations development. The resulting design space has an over 50-fold increase in dimensionality compared to our previous work. Here, we present a dataset of 812 formulations, including 294 stable samples, which cover the entire design space, with phase stability, turbidity, and high-fidelity rheology measurements generated on our semi-automated, ML-driven liquid formulations workflow. Our dataset has the unique attribute of sample-specific uncertainty measurements to train predictive surrogate models.

Published in Scientific Data

ISSN: 2052-4463 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/sdata/

About the journal