Cuneiform Text Dialect Identification Using Machine Learning Algorithms and Natural Language Processing (NLP)

Elaf A. Saeed; Ammar D.  Jasim; Munther A.  Abdul Malik

doi:10.31987/ijict.7.2.265

Iraqi Journal of Information & Communication Technology (Sep 2024)

Cuneiform Text Dialect Identification Using Machine Learning Algorithms and Natural Language Processing (NLP)

Elaf A. Saeed,
Ammar D. Jasim ,
Munther A. Abdul Malik

Affiliations

Elaf A. Saeed: Al-Nahrain University
Ammar D. Jasim
Munther A. Abdul Malik

DOI: https://doi.org/10.31987/ijict.7.2.265
Journal volume & issue: Vol. 7, no. 2

Abstract

Read online

Due to a lack of resources and the tokenization issue, it is challenging to identify the languages inscribed in cuneiform symbols. Sumerian and six dialects of the Akkadian language-Old Babylonian, Middle Babylonian Peripheral, Standard Babylonian, Neo-Babylonian, Late Babylonian, and Neo-Assyrian-are among the seven languages and dialects written in cuneiform that need to be identified. This problem is addressed by the Cuneiform Language Identification task in VarDial 2019. This paper presents ten machine learning algorithms derived from four types of machine learning that were used (supervised, ensemble, instance-based, and Artificial Neural Network) learnings. The Support Vector Machine (SVM), Na Bayes (NB), Logistic Regression (LR), and Decision Tree (DT) algorithms within supervised learning, the K-Nearest Neighbors algorithm (KNN) within instance- based learning, the Random Forest (RF), Adaptive Boosting (Adaboost), Extreme Gradient Boosting (XGBoost), and Gradient Boosting (GB) algorithms within ensemble learning. Also, one of the natural language processing algorithms, n-gram, is used to identify the cuneiform dialect. The best result belongs to an ensemble of Random Forest classifiers working on character-level features with a macro averaged F1 score of 96%, and the best outcome for the n-grams algorithm is 0.82% of di-gram.

Cuneiform, unigram, CLI, Over-sampling, SVM, RF, DT, KNN, DNN.

Published in Iraqi Journal of Information & Communication Technology

ISSN: 2222-758X (Print); 2789-7362 (Online)
Publisher: College of Information Engineering
Country of publisher: Iraq
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: https://ijict.edu.iq/index.php/ijict/index

About the journal

Abstract

Keywords