Frontiers in Sustainable Food Systems (Oct 2020)

Predictive Models May Complement or Provide an Alternative to Existing Strategies for Assessing the Enteric Pathogen Contamination Status of Northeastern Streams Used to Provide Water for Produce Production

  • Daniel L. Weller,
  • Daniel L. Weller,
  • Tanzy M. T. Love,
  • Alexandra Belias,
  • Martin Wiedmann

DOI
https://doi.org/10.3389/fsufs.2020.561517
Journal volume & issue
Vol. 4

Abstract

Read online

While the Food Safety Modernization Act established standards for the use of surface water for produce production, water quality is known to vary over space and time. Targeted approaches for identifying hazards in water that account for this variation may improve growers' ability to address pre-harvest food safety risks. Models that utilize publicly-available data (e.g., land-use, real-time weather) may be useful for developing these approaches. The objective of this study was to use pre-existing datasets collected in 2017 (N = 181 samples) and 2018 (N = 191 samples) to train and test models that predict the likelihood of detecting Salmonella and pathogenic E. coli markers (eaeA, stx) in agricultural water. Four types of features were used to train the models: microbial, physicochemical, spatial and weather. “Full models” were built using all four features types, while “nested models” were built using between one and three types. Twenty learners were used to develop separate full models for each pathogen. Separately, to assess information gain associated with using different feature types, six learners were randomly selected and used to develop nine, nested models each. Performance measures for each model were then calculated and compared against baseline models where E. coli concentration was the sole covariate. In the methods, we outline the advantages and disadvantages of each learner. Overall, full models built using ensemble (e.g., Node Harvest) and “black-box” (e.g., SVMs) learners out-performed full models built using more interpretable learners (e.g., tree- and rule-based learners) for both outcomes. However, nested eaeA-stx models built using interpretable learners and microbial data performed almost as well as these full models. While none of the nested Salmonella models performed as well as the full models, nested models built using spatial data consistently out-performed models that excluded spatial data. These findings demonstrate that machine learning approaches can be used to predict when and where pathogens are likely to be present in agricultural water. This study serves as a proof-of-concept that can be built upon once larger datasets become available and provides guidance on the learner-data combinations that should be the foci of future efforts (e.g., tree-based microbial models for pathogenic E. coli).

Keywords