Efficient ML Lifecycle Transferring for Large-Scale and High-Dimensional Data via Core Set-Based Dataset Similarity

Van-Duc Le; Tien-Cuong Bui; Wen-Syan Li

doi:10.1109/ACCESS.2023.3296136

IEEE Access (Jan 2023)

Efficient ML Lifecycle Transferring for Large-Scale and High-Dimensional Data via Core Set-Based Dataset Similarity

Van-Duc Le,
Tien-Cuong Bui,
Wen-Syan Li

Affiliations

Van-Duc Le: ORCiD; Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea
Tien-Cuong Bui: ORCiD; Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea
Wen-Syan Li: Graduate School of Data Science, Seoul National University, Seoul, South Korea

DOI: https://doi.org/10.1109/ACCESS.2023.3296136
Journal volume & issue: Vol. 11
pp. 73823 – 73838

Abstract

Read online

Developing an end-to-end machine learning (ML) lifecycle for an ML task can be costly and time-consuming. It involves exploring multiple configurations of ML pipelines, encompassing data preparation, ML model design, training, and deployment. While automated ML (AutoML) can assist in automatically searching and training an optimized ML pipeline, it is computationally intensive and lacks reusability for high-dimensional datasets. Transfer learning has emerged as a popular technique for fine-tuning pre-trained models on related datasets, yet it still requires manual tuning to achieve optimal results. To overcome these challenges, we present a version management system for the end-to-end ML lifecycle, enabling the transfer of lifecycle versions from similar datasets to new ML tasks. Specifically, we introduce an algorithm that leverages core sets to compute similarities for large-scale and high-dimensional datasets efficiently. To the best of our knowledge, we are the first to investigate ML lifecycle transfer for similar high-dimensional datasets. We conducted experiments on real-world datasets comprising computer vision and spatiotemporal sensor data. The experimental results demonstrate the effectiveness of our dataset similarity algorithm and the ML lifecycle version transferring procedure, reducing dataset similarity computation time by up to 60x while improving model accuracy compared to transfer learning. Furthermore, in a practical case study, our solution exhibited up to 3.5x greater efficiency in training time and memory consumption and 9% better model accuracy than manual tuning approaches.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords