IEEE Access (Jan 2024)

Evaluating Neural Network Models for Word Segmentation in Agglutinative Languages: Comparison With Rule-Based Approaches and Statistical Models

  • William Villegas-Ch,
  • Rommel Gutierrez,
  • Alexandra Maldonado Navarro,
  • Aracely Mera-Navarrete

DOI
https://doi.org/10.1109/ACCESS.2024.3486188
Journal volume & issue
Vol. 12
pp. 157556 – 157573

Abstract

Read online

Word segmentation in agglutinative languages presents significant challenges due to morphological complexity and variability of linguistic structure. Although practical, traditional rule-based and statistical model-based approaches show limitations in handling these complexities. This study investigates the effectiveness of neural network models, specifically LSTM, Bi-LSTM with CRF, and BERT, in comparison to these traditional methods, using datasets from several agglutinative languages such as Turkish, Finnish, Hungarian, Nahuatl, and Swahili. The methodology includes preprocessing and data augmentation to improve data quality and consistency, followed by training and evaluating the selected models. The results reveal that the neural network models significantly outperform rule-based and statistical model-based approaches on all metrics assessed. Specifically, for the rule-based models, the BERT model achieved 92% accuracy and 91% F1-score in Turkish, compared to 70% and 67%, respectively. Moreover, the Bi-LSTM with CRF showed 86% recall in Finnish, significantly outperforming traditional models. Implementing advanced preprocessing and data augmentation techniques allows for optimizing the performance of the models. This study confirms the effectiveness of neural network models in word segmentation and provides a valuable framework for future research in natural language processing in complex linguistic contexts.

Keywords