Journal of Hydroinformatics (Sep 2021)

A feature extraction method based on the entropy-minimal description length principle and GBDT for common surface water pollution identification

  • Pingjie Huang,
  • Lixiang Wang,
  • Dibo Hou,
  • Wangli Lin,
  • Jie Yu,
  • Guangxin Zhang,
  • Hongjian Zhang

DOI
https://doi.org/10.2166/hydro.2021.060
Journal volume & issue
Vol. 23, no. 5
pp. 1050 – 1065

Abstract

Read online

To effectively prevent river water pollution, water quality monitoring is necessary. However, existing methods for water quality assessment are limited in terms of the characterization of water quality conditions, and few researchers have been able to focus on feature extraction methods relative to water pollution identification, or to obtain accurate water pollution source information. Thus, this study proposed a feature extraction method based on the entropy-minimal description length principle and gradient boosting decision tree (GBDT) algorithm for identifying the type of surface water pollution in consideration of the distribution characteristics and intrinsic association of conventional water quality indicators. To improve the robustness to noise, we constructed the coarse-grained discretization features of each water quality index based on information entropy. The nonlinear correlation between water quality indexes and pollution classes was excavated by the GBDT algorithm, which was utilized to acquire tree transformed features. Water samples collected by a southern city Environmental Monitoring Center were used to test the performance of the proposed algorithm. Experimental results demonstrate that features extracted by the proposed method are more effective than the water quality indicators without feature engineering and features extracted by the principal component analysis algorithm. HIGHLIGHTS Different water pollutions have unique attributes for risk characterization.; Based on our study of the characteristics of water quality data, we proposed an innovative feature extraction method based on the entropy-minimal description length principle and gradient boosting decision tree algorithm.; We focus on the research into the feature extraction method in water pollution identification.;

Keywords