IEEE Access (Jan 2025)

Applying Machine Learning on Big Data With Apache Spark

  • Elias Dritsas,
  • Maria Trigka

DOI
https://doi.org/10.1109/access.2025.3552042
Journal volume & issue
Vol. 13
pp. 53377 – 53393

Abstract

Read online

The exponential growth of data in the digital age has necessitated the development of frameworks capable of efficiently handling and processing vast datasets. This paper explores the application of machine learning (ML) models within the Apache Spark ecosystem, focusing on the performance and scalability of these models in big data environments. Through comprehensive experiments on three diverse datasets, namely NYC Taxi Trip Duration, Netflix Prize, and Higgs Boson, we address both regression and classification tasks. For the regression tasks using the NYC Taxi Trip Duration and Netflix Prize datasets, we evaluated models including Linear Regression (LinR), Random Forest (RF), Gradient-Boosted Trees (GBT), Support Vector Regressor (SVR), and K-Nearest Neighbors (KNN). For the classification task using the Higgs Boson dataset, we assessed models such as Logistic Regression (LR), RF, GBT, Support Vector Machines (SVM), and KNN. The study employed key performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) for regression and Accuracy, Precision, Recall, F1-Score, and Area Under the Curve (AUC) for classification. Our findings indicate that Apache Spark’s in-memory processing and distributed computing capabilities provide effective scalability, allowing these models to handle large-scale data with linear increases in processing time. Finally, that study highlights the importance of model selection and resource optimization in big data contexts and provides valuable insights into the practical integration of ML models within the Spark framework.

Keywords