Combatting over-specialization bias in growing chemical databases

Katharina Dost; Zac Pullar-Strecker; Liam Brydon; Kunyang Zhang; Jasmin Hafner; Patricia J. Riddle; Jörg S. Wicker

doi:10.1186/s13321-023-00716-w

Journal of Cheminformatics (May 2023)

Combatting over-specialization bias in growing chemical databases

Katharina Dost,
Zac Pullar-Strecker,
Liam Brydon,
Kunyang Zhang,
Jasmin Hafner,
Patricia J. Riddle,
Jörg S. Wicker

Affiliations

Katharina Dost: School of Computer Science, University of Auckland
Zac Pullar-Strecker: School of Computer Science, University of Auckland
Liam Brydon: School of Computer Science, University of Auckland
Kunyang Zhang: Eawag-Swiss Federal Institute of Aquatic Science and Technology
Jasmin Hafner: Eawag-Swiss Federal Institute of Aquatic Science and Technology
Patricia J. Riddle: School of Computer Science, University of Auckland
Jörg S. Wicker: School of Computer Science, University of Auckland

DOI: https://doi.org/10.1186/s13321-023-00716-w
Journal volume & issue: Vol. 15, no. 1
pp. 1 – 17

Abstract

Read online

Abstract Background Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers’ experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space. Proposed solution In this paper, we propose cancels (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. cancels does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain. Results An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that cancels produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor’s performance while reducing the number of required experiments. Overall, we believe that cancels can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels .

Published in Journal of Cheminformatics

ISSN: 1758-2946 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Chemistry
Website: https://jcheminf.biomedcentral.com/

About the journal

Abstract

Keywords