Romanized Tunisian dialect transliteration using sequence labelling techniques

Jihene Younes; Hadhemi Achour; Emna Souissi; Ahmed Ferchichi

Journal of King Saud University: Computer and Information Sciences (Mar 2022)

Romanized Tunisian dialect transliteration using sequence labelling techniques

Jihene Younes,
Hadhemi Achour,
Emna Souissi,
Ahmed Ferchichi

Affiliations

Jihene Younes: Université de Tunis, ISGT, LR99ES04 BESTMOD, 2000, Le Bardo, Tunisia; Corresponding author.
Hadhemi Achour: Université de Tunis, ISGT, LR99ES04 BESTMOD, 2000, Le Bardo, Tunisia
Emna Souissi: Université de Tunis, ENSIT, 1008 Montfleury, Tunisia
Ahmed Ferchichi: Université de Tunis, ISGT, LR99ES04 BESTMOD, 2000, Le Bardo, Tunisia

Journal volume & issue: Vol. 34, no. 3
pp. 982 – 992

Abstract

Read online

In recent years, social web users in Arabic countries have been resorting to the dialects as a written language in their social exchanges. Arabic dialects derive from modern standard Arabic (MSA) and differ significantly from one country to another and one region to another. The use of these dialects has led to an increase of interest in the specificities of such informal languages and their automatic processing within the NLP community. In this work, we deal with the Tunisian dialect (TD) in particular. We address the issue of the automatic Latin to Arabic transliteration of TD language productions on the social web and propose an approach that models the transliteration as a sequence labeling task. At a word level, several techniques, based on machine and deep learning, have been tested for this study, using real word messages extracted from social networks. We experiment and compare three transliteration models: A Conditional Random Fields-based model (CRF), a Bidirectional Long Short-Term Memory based model (BLSTM), and a BLSTM based model with CRF decoding (BLSTM-CRF). The obtained results show that BLSTM-CRF, leads to the best performance, reaching 96.78% of correctly transliterated words. We also evaluate the BLSTM-CRF transliteration approach in context on a set of random TD messages extracted from the social web. We obtained a total error rate of 2.7%. 25% of which are context errors.

Published in Journal of King Saud University: Computer and Information Sciences

ISSN: 1319-1578 (Print); 2213-1248 (Online)
Publisher: Springer
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://link.springer.com/journal/44443

About the journal

Abstract

Keywords