IEEE Access (Jan 2024)

Semi-Automatic Dataset Annotation Applied to Automatic Violent Message Detection

  • Beatriz Botella-Gil,
  • Robiert Sepulveda-Torres,
  • Alba Bonet-Jover,
  • Patricio Martinez-Barco,
  • Estela Saquete

DOI
https://doi.org/10.1109/ACCESS.2024.3361404
Journal volume & issue
Vol. 12
pp. 19651 – 19664

Abstract

Read online

Annotated corpora are indispensable tools to train computational models in Artificial Intelligence and Natural Language Processing. However, manual annotation is a costly, arduous, and time-consuming task, especially when the annotation is semantically complex. To address the problem, this work applies a methodology for semi-automatic annotation of datasets based on the Human-in-the-Loop paradigm. The methodology supports the building of a resource, that benefits from a fine-grained annotation, to aid in the detection of Spanish violent messages sourced from social media (Twitter/X). After implementing the proposed methodology for semi-automatic violence annotation, a high quality resource was obtained (hereafter referred to as VILLANOS). The methodology consists of annotating the dataset incrementally, which delivers an increase in annotator efficiency, thereby validating the suitability of the proposal. Annotation time was reduced by 52% compared to manual annotation and performance, by training a model with the VILLANOS dataset, obtains an $F_{1}$ of 85.2%. These results demonstrate the efficiency and effectiveness of the methodology, evidencing its validity.

Keywords