Journal of Cheminformatics (Apr 2025)
InertDB as a generative AI-expanded resource of biologically inactive small molecules from PubChem
Abstract
Abstract The development of robust artificial intelligence (AI)-driven predictive models relies on high-quality, diverse chemical datasets. However, the scarcity of negative data and a publication bias toward positive results often hinder accurate biological activity prediction. To address this challenge, we introduce InertDB, a comprehensive database comprising 3,205 curated inactive compounds (CICs) identified through rigorous review of over 4.6 million compound records in PubChem. CIC selection prioritized bioassay diversity, determined using natural language processing (NLP)-based clustering metrics, while ensuring minimal biological activity across all evaluated bioassays. Notably, 97.2% of CICs adhere to the Rule of Five, a proportion significantly higher than that of overall PubChem dataset. To further expand the chemical space, InertDB also features 64,368 generated inactive compounds (GICs) produced using a deep generative AI model trained on the CIC dataset. Compared to conventional approaches such as random sampling or property-matched decoys, InertDB significantly improves predictive AI performance, particularly for phenotypic activity prediction by providing reliable inactive compound sets. Scientific contributions InertDB addresses a critical gap in AI-driven drug discovery by providing a comprehensive repository of biologically inactive compounds, effectively resolving the scarcity of negative data that limits prediction accuracy and model reliability. By leveraging language model-based bioassay diversity metrics and generative AI, InertDB integrates rigorously curated inactive compounds with an expanded chemical space. InertDB serves as a valuable alternative to random sampling and decoy generation, offering improved training datasets and enhancing the accuracy of phenotypic pharmacological activity prediction.
Keywords