Efficient Data-Driven Machine Learning Models for Water Quality Prediction

Elias Dritsas; Maria Trigka

doi:10.3390/computation11020016

Computation (Jan 2023)

Efficient Data-Driven Machine Learning Models for Water Quality Prediction

Elias Dritsas,
Maria Trigka

Affiliations

Elias Dritsas: Department of Computer Engineering and Informatics, University of Patras, 26504 Patras, Greece
Maria Trigka: Department of Computer Engineering and Informatics, University of Patras, 26504 Patras, Greece

DOI: https://doi.org/10.3390/computation11020016
Journal volume & issue: Vol. 11, no. 2
p. 16

Abstract

Read online

Water is a valuable, necessary and unfortunately rare commodity in both developing and developed countries all over the world. It is undoubtedly the most important natural resource on the planet and constitutes an essential nutrient for human health. Geo-environmental pollution can be caused by many different types of waste, such as municipal solid, industrial, agricultural (e.g., pesticides and fertilisers), medical, etc., making the water unsuitable for use by any living being. Therefore, finding efficient methods to automate checking of water suitability is of great importance. In the context of this research work, we leveraged a supervised learning approach in order to design as accurate as possible predictive models from a labelled training dataset for the identification of water suitability, either for consumption or other uses. We assume a set of physiochemical and microbiological parameters as input features that help represent the water’s status and determine its suitability class (namely safe or nonsafe). From a methodological perspective, the problem is treated as a binary classification task, and the machine learning models’ performance (such as Naive Bayes–NB, Logistic Regression–LR, k Nearest Neighbours–kNN, tree-based classifiers and ensemble techniques) is evaluated with and without the application of class balancing (i.e., use or nonuse of Synthetic Minority Oversampling Technique–SMOTE), comparing them in terms of Accuracy, Recall, Precision and Area Under the Curve (AUC). In our demonstration, results show that the Stacking classification model after SMOTE with 10-fold cross-validation outperforms the others with an Accuracy and Recall of 98.1%, Precision of 100% and an AUC equal to 99.9%. In conclusion, in this article, a framework is presented that can support the researchers’ efforts toward water quality prediction using machine learning (ML).

Published in Computation

ISSN: 2079-3197 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.mdpi.com/journal/computation

About the journal

Abstract

Keywords