Environmental Sciences Europe (Apr 2024)

Developing an ensembled machine learning model for predicting water quality index in Johor River Basin

  • L. M. Sidek,
  • H. A. Mohiyaden,
  • M. Marufuzzaman,
  • N. S. M. Noh,
  • Salim Heddam,
  • Mohammad Ehteram,
  • Ozgur Kisi,
  • Saad Sh. Sammen

DOI
https://doi.org/10.1186/s12302-024-00897-7
Journal volume & issue
Vol. 36, no. 1
pp. 1 – 17

Abstract

Read online

Abstract Currently, the Water Quality Index (WQI) model becomes a widely used tool to evaluate surface water quality for agriculture, domestic and industrial. WQI is one of the simplest mathematical tools that can assist water operator in decision making in assessing the quality of water and it is widely used in the last years. The water quality analysis and prediction is conducted for Johor River Basin incorporating the upstream to downstream water quality monitoring station data of the river. In this research, the numerical method is first used to calculate the WQI and identify the classes for validating the prediction results. Then, two ensemble and optimized machine learning models including gradient boosting regression (GB) and random forest regression (RF) are employed to predict the WQI. The study area selected is the Johor River basin located in Johor, Peninsular Malaysia. The initial phase of this study involves analyzing all available data on parameters concerning the river, aiming to gain a comprehensive understanding of the overall water quality within the river basin. Through temporal analysis, it was determined that Mg, E. coli, SS, and DS emerge as critical factors affecting water quality in this river basin. Then, in terms of WQI calculation, feature importance method is used to identify the most important parameters that can be used to predict the WQI. Finally, an ensemble-based machine learning model is designed to predict the WQI using three parameters. Two ensemble ML approaches are chosen to predict the WQI in the study area and achieved a R 2 of 0.86 for RF-based regression and 0.85 for GB-based ML technique. Finally, this research proves that using only the biochemical oxygen demand (BOD), the chemical oxygen demand (COD) and percentage of dissolved oxygen (DO%), the WQI can be predicted accurately and almost 96 times out of 100 sample, the water class can be predicted using GB ensembled ML algorithm. Moving forward, stakeholders may opt to integrate this research into their analyses, potentially yielding economic reliability and time savings.

Keywords