Enhancing model robustness to imbalanced species abundance distributions: Eliminating misclassified records via a model-agnostic approach, exemplified by tuna fisheries datasets

Zhexuan Li; Tianjiao Zhang; Liming Song

Ecological Informatics (Dec 2024)

Enhancing model robustness to imbalanced species abundance distributions: Eliminating misclassified records via a model-agnostic approach, exemplified by tuna fisheries datasets

Zhexuan Li,
Tianjiao Zhang,
Liming Song

Affiliations

Zhexuan Li: College of Information Technology, Shanghai Ocean University, Shanghai, 201306, Shanghai, China
Tianjiao Zhang: College of Information Technology, Shanghai Ocean University, Shanghai, 201306, Shanghai, China; Corresponding author.
Liming Song: College of Marine Sciences, Shanghai Ocean University, Shanghai, 201306, Shanghai, China; National Engineering Research Center for Oceanic Fisheries, Shanghai, 201306, Shanghai, China

Journal volume & issue: Vol. 84
p. 102905

Abstract

Read online

Anomalies in species abundance data can potentially cause classification errors in ecological forecasting models. Accurate estimation of anomalies locations can enhance the predictive capacity of models. This study aims to propose an approach for precisely identifying and correcting anomalies within imbalanced species abundance data, thereby addressing the challenges posed by both anomalous and imbalanced species abundance distributions (SADs). A model-agnostic statistical tool, Confident Learning (CL) theory, is introduced to estimate the probability of each sample being misclassified during the prediction phase. Specifically, the approach targets classification errors from models trained on imbalanced SADs for data-cleansing, identifying these records as anomalies. The approach is applied to tuna fisheries datasets, focusing specifically on bigeye tuna (Thunnus obesus), a targeted species in longline fishing, and albacore tuna (Thunnus alalunga), a non-target species in the tropical Atlantic Ocean. These datasets, spanning from 2016 to 2019, featured a spatial resolution of 0.5° × 0.5° and daily temporal resolution, providing a comprehensive view of imbalanced data scenarios. The results demonstrate that all the predictors: Support Vector Machine (SVM), Logistic Regression (LR), Random Forests (RF), and Extreme Gradient Boosting (XGBoost) were considerably enhanced after training on the cleaned datasets. Notably, SVM and LR achieved overall accuracy rates of over 90% on predicting both low- and high-abundant fishing grounds. The proposed approach reveals that the elimination of anomalies can enhance the robustness of ecological forecasting models to imbalanced SADs, offering new insights and technical support for the delicate prediction and assessment of ecological resources.

Published in Ecological Informatics

ISSN: 1574-9541 (Print); 1878-0512 (Online)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Biology (General): Ecology
Website: https://www.sciencedirect.com/journal/ecological-informatics

About the journal

Abstract

Keywords