Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

Akın Özçift; Kamil Akarsu; Fatma Yumuk; Cevhernur Söylemez

doi:10.1080/00051144.2021.1922150

Automatika (Apr 2021)

Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

Akın Özçift,
Kamil Akarsu,
Fatma Yumuk,
Cevhernur Söylemez

Affiliations

Akın Özçift: Manisa Celal Bayar University
Kamil Akarsu: Manisa Celal Bayar University
Fatma Yumuk: Manisa Celal Bayar University
Cevhernur Söylemez: Manisa Celal Bayar University

DOI: https://doi.org/10.1080/00051144.2021.1922150
Journal volume & issue: Vol. 62, no. 2
pp. 226 – 238

Abstract

Read online

Language model pre-training architectures have demonstrated to be useful to learn language representations. bidirectional encoder representations from transformers (BERT), a recent deep bidirectional self-attention representation from unlabelled text, has achieved remarkable results in many natural language processing (NLP) tasks with fine-tuning. In this paper, we want to demonstrate the efficiency of BERT for a morphologically rich language, Turkish. Traditionally morphologically difficult languages require dense language pre-processing steps in order to model the data to be suitable for machine learning (ML) algorithms. In particular, tokenization, lemmatization or stemming and feature engineering tasks are needed to obtain an efficient data model to overcome data sparsity or high-dimension problems. In this context, we selected five various Turkish NLP research problems as sentiment analysis, cyberbullying identification, text classification, emotion recognition and spam detection from the literature. We then compared the empirical performance of BERT with the baseline ML algorithms. Finally, we found enhanced results compared to base ML algorithms in the selected NLP problems while eliminating heavy pre-processing tasks.

Published in Automatika

ISSN: 0005-1144 (Print); 1848-3380 (Online)
Publisher: Taylor & Francis Group
Country of publisher: United Kingdom
LCC subjects: Technology: Mechanical engineering and machinery: Control engineering systems. Automatic machinery (General); Technology: Technology (General): Industrial engineering. Management engineering: Automation
Website: https://www.tandfonline.com/journals/taut

About the journal

Abstract

Keywords