Scientific Reports (Mar 2023)

A proficiency assessment of integrating machine learning (ML) schemes on Lahore water ensemble

  • Nazish Shahid

DOI
https://doi.org/10.1038/s41598-023-32280-6
Journal volume & issue
Vol. 13, no. 1
pp. 1 – 19

Abstract

Read online

Abstract A synthesis of statistical inference and machine learning (ML) tools has been employed to establish a comprehensive insight of a coarse data. Water components’ data for 16 central distributing locations of Lahore, the capital of second most populated province of Pakistan, has been analyzed to gauge current water stature of the city. Moreover, a classification of surplus-response variables through tolerance manipulation was incorporated to debrief dimension aspect of the data. By the same token, the influence of supererogatory variables’ renouncement through identification of clustering movement of constituents is inquired. The approach of building a spectrum of colluding results through application of comparable methods has been experimented. To test the propriety of each statistical method prior to its execution on a huge data, a faction of ML schemes have been proposed. The supervised learning tools pca, factoran and clusterdata were implemented to establish an elemental character of water at elected locations. A location ‘LAH-13’ was highlighted for containing an out of normal range Total Dissolved Solids (TDS) concentration in the water. The classification of lower and higher variability parameters carried out by Sample Mean (XBAR) control identified a set of least correlated variables pH, As, Total Coliforms and E. Coli. The analysis provided four locations LAH-06, LAH-10, LAH-13 and LAH-14 for extreme concentration propensity. An execution of factoran demonstrated that specific tolerance of independent variability ‘0.005’ could be employed to reduce dimension of a system without loss of fundamental data information. A higher value of cophenetic coefficient, c = 0.9582 provided the validation for an accurate cluster division of similar characteristics’ variables. The current approach of mutually validating ML and SA (statistical analysis) schemes will assist in preparing the groundwork for state of the art analysis (SOTA) analysis. The advantage of our approach can be examined through the fact that the related SOTA will further refine the predictive precision between two comparable methods, unlike the SOTA analysis between two random ML methods. Conclusively, this study featured the locations LAH-03, LAH-06, LAH-12, LAH-13, LAH-14 and LAH-15 with compromised water quality in the region.