IEEE Access (Jan 2024)

Improving Turkish Text Sentiment Classification Through Task-Specific and Universal Transformations: An Ensemble Data Augmentation Approach

  • Aytug Onan,
  • Kadriye Filiz Balbal

DOI
https://doi.org/10.1109/ACCESS.2024.3349971
Journal volume & issue
Vol. 12
pp. 4413 – 4458

Abstract

Read online

The exponential growth of digital data in recent years has spurred a significant interest in natural language processing (NLP) and sentiment analysis. However, the effectiveness of NLP models heavily relies on the availability of large, annotated datasets, which are often scarce or entirely absent for numerous languages, including Turkish. This scarcity of annotated data for Turkish presents a formidable obstacle in developing NLP models for the language. To overcome this challenge, various techniques have been proposed to augment the size of annotated datasets, with text data augmentation emerging as a promising solution. Text data augmentation involves the generation of synthetic data by transforming existing data, thus expanding the diversity and volume of the annotated dataset. While this technique has shown remarkable success in bolstering the performance of NLP models, its exploration in the context of Turkish and other low-resource languages has been limited. This paper introduces a novel ensemble approach to text data augmentation tailored for Turkish text sentiment classification. Our approach integrates both task-specific and universal transformations, capitalizing on the strengths of each to enrich the training dataset. We evaluate our proposed approach on the TRSAv1 dataset and compare it with established data augmentation techniques. The experimental results demonstrate that our ensemble method achieves superior accuracy in sentiment classification compared to conventional techniques. Additionally, we conduct an in-depth analysis to assess the impact of individual transformation functions on classification performance. Our contribution lies in bridging the gap in research on data augmentation techniques tailored to Turkish NLP, emphasizing the need for more advanced ensemble methods, and offering benchmarking results that pave the way for the development of precise NLP models not only for Turkish but also for other low-resource languages.

Keywords