Information (Dec 2023)

Offensive Text Span Detection in Romanian Comments Using Large Language Models

  • Andrei Paraschiv,
  • Teodora Andreea Ion,
  • Mihai Dascalu

DOI
https://doi.org/10.3390/info15010008
Journal volume & issue
Vol. 15, no. 1
p. 8

Abstract

Read online

The advent of online platforms and services has revolutionized communication, enabling users to share opinions and ideas seamlessly. However, this convenience has also brought about a surge in offensive and harmful language across various communication mediums. In response, social platforms have turned to automated methods to identify offensive content. A critical research question emerges when investigating the role of specific text spans within comments in conveying offensive characteristics. This paper conducted a comprehensive investigation into detecting offensive text spans in Romanian language comments using Transformer encoders and Large Language Models (LLMs). We introduced an extensive dataset of 4800 Romanian comments annotated with offensive text spans. Moreover, we explored the impact of varying model sizes, architectures, and training data volumes on the performance of offensive text span detection, providing valuable insights for determining the optimal configuration. The results argue for the effectiveness of BERT pre-trained models for this span-detection task, showcasing their superior performance. We further investigated the impact of different sample-retrieval strategies for few-shot learning using LLMs based on vector text representations. The analysis highlights important insights and trade-offs in leveraging LLMs for offensive-language-detection tasks.

Keywords