InertDB as a generative AI-expanded resource of biologically inactive small molecules from PubChem

Seungchan An; Yeonjin Lee; Junpyo Gong; Seokyoung Hwang; In Guk Park; Jayhyun Cho; Min Ju Lee; Minkyu Kim; Yun Pyo Kang; Minsoo Noh

doi:10.1186/s13321-025-00999-1

Journal of Cheminformatics (Apr 2025)

InertDB as a generative AI-expanded resource of biologically inactive small molecules from PubChem

Seungchan An,
Yeonjin Lee,
Junpyo Gong,
Seokyoung Hwang,
In Guk Park,
Jayhyun Cho,
Min Ju Lee,
Minkyu Kim,
Yun Pyo Kang,
Minsoo Noh

Affiliations

Seungchan An: College of Pharmacy, Natural Products Research Institute, Seoul National University
Yeonjin Lee: College of Pharmacy, Natural Products Research Institute, Seoul National University
Junpyo Gong: College of Pharmacy, Natural Products Research Institute, Seoul National University
Seokyoung Hwang: College of Pharmacy, Natural Products Research Institute, Seoul National University
In Guk Park: College of Pharmacy, Natural Products Research Institute, Seoul National University
Jayhyun Cho: College of Pharmacy, Natural Products Research Institute, Seoul National University
Min Ju Lee: College of Pharmacy, Natural Products Research Institute, Seoul National University
Minkyu Kim: College of Pharmacy, Natural Products Research Institute, Seoul National University
Yun Pyo Kang: College of Pharmacy, Natural Products Research Institute, Seoul National University
Minsoo Noh: College of Pharmacy, Natural Products Research Institute, Seoul National University

DOI: https://doi.org/10.1186/s13321-025-00999-1
Journal volume & issue: Vol. 17, no. 1
pp. 1 – 14

Abstract

Read online

Abstract The development of robust artificial intelligence (AI)-driven predictive models relies on high-quality, diverse chemical datasets. However, the scarcity of negative data and a publication bias toward positive results often hinder accurate biological activity prediction. To address this challenge, we introduce InertDB, a comprehensive database comprising 3,205 curated inactive compounds (CICs) identified through rigorous review of over 4.6 million compound records in PubChem. CIC selection prioritized bioassay diversity, determined using natural language processing (NLP)-based clustering metrics, while ensuring minimal biological activity across all evaluated bioassays. Notably, 97.2% of CICs adhere to the Rule of Five, a proportion significantly higher than that of overall PubChem dataset. To further expand the chemical space, InertDB also features 64,368 generated inactive compounds (GICs) produced using a deep generative AI model trained on the CIC dataset. Compared to conventional approaches such as random sampling or property-matched decoys, InertDB significantly improves predictive AI performance, particularly for phenotypic activity prediction by providing reliable inactive compound sets. Scientific contributions InertDB addresses a critical gap in AI-driven drug discovery by providing a comprehensive repository of biologically inactive compounds, effectively resolving the scarcity of negative data that limits prediction accuracy and model reliability. By leveraging language model-based bioassay diversity metrics and generative AI, InertDB integrates rigorously curated inactive compounds with an expanded chemical space. InertDB serves as a valuable alternative to random sampling and decoy generation, offering improved training datasets and enhancing the accuracy of phenotypic pharmacological activity prediction.

Published in Journal of Cheminformatics

ISSN: 1758-2946 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Chemistry
Website: https://jcheminf.biomedcentral.com/

About the journal

Abstract

Keywords