Artificial Intelligence in the Life Sciences (Jun 2024)
Statistical approaches enabling technology-specific assay interference prediction from large screening data sets
Abstract
High throughput screening (HTS) technologies allow the biological testing of hundreds of thousands of compounds per day. Typically, a substantial proportion of the initial hits obtained by HTS are artifacts caused by assay interference. Therefore, global and technology-specific in silico models for identifying and predicting compounds interfering with biological assays have been developed. The global models benefit from training on large screening data sets, while the specialized models benefit from training on assay technology-specific experimental data. In this work, we develop and explore strategies for generating better predictors of technology-specific assay interference by utilizing the large bioactivity data matrices global models are trained on and employing partially new compound labeling approaches to maintain the assay technology awareness of specialized models. We demonstrate the utility of the statistically derived interference labels in machine learning using fluorescence-based assay interference as a representative example. Our random forest and multi-layer perceptron classifiers showed improved performance compared to existing models, achieving Matthews correlation coefficients (MCCs) of up to 0.47 on holdout data and up to 0.45 on an external test set. These results demonstrate that accurate assay-specific interference labels can be derived from large bioactivity data matrices, enabling the development of new machine-learning models without the need for further experimental data.