Machine and deep learning performance in out-of-distribution regressions

Assaf Shmuel; Oren Glickman; Teddy Lazebnik

doi:10.1088/2632-2153/ada221

Machine Learning: Science and Technology (Jan 2025)

Machine and deep learning performance in out-of-distribution regressions

Assaf Shmuel,
Oren Glickman,
Teddy Lazebnik

Affiliations

Assaf Shmuel: ORCiD; Department of Computer Science, Bar Ilan University , Ramat Gan, Israel
Oren Glickman: ORCiD; Department of Computer Science, Bar Ilan University , Ramat Gan, Israel
Teddy Lazebnik: ORCiD; Department of Cancer Biology, Cancer Institute , University College London, London, United Kingdom

DOI: https://doi.org/10.1088/2632-2153/ada221
Journal volume & issue: Vol. 5, no. 4
p. 045078

Abstract

Read online

Machine learning (ML) and deep learning (DL) models are gaining popularity due to their effectiveness in many computational tasks. These models are based on an intuitive, but frequently unsatisfied, assumption that the data used to train these models is well-representing the task at hand. This gives rise to the out-of-distribution (OOD) challenge which can cause an unexpected drop in the data-driven model’s performance. In this study, we evaluate the performance of various ML and DL models in in-distribution (ID) versus OOD prediction. While the degradation in OOD performance is well acknowledged, to the best of our knowledge, this is one of the first studies to quantify it for various models on a large benchmark n = 15 real-world regression datasets. We extensively ( $n \gt 40\,000$ runs) compare the ID versus OOD performance of XGBoost, random forest, K-nearest-neighbors, support vector machine, and linear regression models, as well as AutoML models (Tree-based Pipeline Optimization Tool and AutoKeras). In addition, to tackle this challenge, we propose to integrate a symbolic regression (SR) as a feature engineering method model with an ML or DL model to improve its performance for OOD samples. Our results show that the incorporation of SR-derived features significantly enhances the predictive capabilities of both ML and DL models with 3.70% and 10.20%, on average, of the OOD samples, respectively, without reducing ID performance and in fact improving it to a slightly lower extent. As such, this method can help produce more generalized and robust data-driven models.

Published in Machine Learning: Science and Technology

ISSN: 2632-2153 (Online)
Publisher: IOP Publishing
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://iopscience.iop.org/journal/2632-2153

About the journal

Abstract

Keywords