Machine Learning with Applications (Sep 2023)
Minimization of high computational cost in data preprocessing and modeling using MPI4Py
Abstract
Data preprocessing is a fundamental stage in deep learning modeling and serves as the cornerstone of reliable data analytics. These deep learning models require significant amounts of training data to be effective, with small datasets often resulting in overfitting and poor performance on large datasets. One solution to this problem is parallelization in data modeling, which allows the model to fit the training data more effectively, leading to higher accuracy on large data sets and higher performance overall. In this research, we developed a novel approach that effectively deployed tools such as MPI and MPI4Py from parallel computing to handle data preprocessing and deep learning modeling processes. As a case study, the technique is applied to COVID-19 data from state of Tennessee, USA. Finally, the effectiveness of our approach is demonstrated by comparing it with existing methods without parallel computing concepts like MPI4Py. Our results demonstrate promising outcome for the deployment of parallel computing in modeling to minimize high computational cost.