Enhancing Word Sense Disambiguation for Amharic homophone words using Bidirectional Long Short-Term Memory network

Mequanent Degu Belete; Lijalem Getanew Shiferaw; Girma Kassa Alitasb; Tariku Sinshaw Tamir

Intelligent Systems with Applications (Sep 2024)

Enhancing Word Sense Disambiguation for Amharic homophone words using Bidirectional Long Short-Term Memory network

Mequanent Degu Belete,
Lijalem Getanew Shiferaw,
Girma Kassa Alitasb,
Tariku Sinshaw Tamir

Affiliations

Mequanent Degu Belete: Department of Electrical and Computer Engineering, Debre Markos College of Technology, Debre Markos University, Debre Markos, Ethiopia
Lijalem Getanew Shiferaw: Head of ICT department, Debre Markos university Library Directorate, Debre Markos University, Debre Markos, Ethiopia
Girma Kassa Alitasb: Department of Electrical and Computer Engineering, Debre Markos College of Technology, Debre Markos University, Debre Markos, Ethiopia; Corresponding author.
Tariku Sinshaw Tamir: Department of Electrical and Computer Engineering, Debre Markos College of Technology, Debre Markos University, Debre Markos, Ethiopia

Journal volume & issue: Vol. 23
p. 200417

Abstract

Read online

Given the Amharic language has a lot of perplexing terminology since it features duplicate homophone letters, fidel's ሀ, ሐ, and ኀ (three of which are pronounced as HA), ሠ and ሰ (both pronounced as SE), አ and ዐ (both pronounced as AE), and ጸ and ፀ (both pronounced as TSE). The WSD (Word Sense Disambiguation) model, which tackles the issue of lexical ambiguity in the context of the Amharic language, is developed using a deep learning technique. Due to the unavailability of the Amharic wordnet, a total of 1756 examples of paired Amharic ambiguous homophonic words were collected. These words were ድህነት(dhnet) and ድኅነት(dhnet), ምሁር(m'hur) and ምሑር(m'hur), በአል(be'al) and በዢል(be'al), አቢይ (abiy) and ዐቢይ(abiy), with a total of 1756 examples. Following word preprocessing, word2vec, fasttext, Term Frequency-Inverse Document Frequency (TFIDF), and bag of words (BoW) were used to vectorize the text. The vectorized text was divided into train and test data. The train data was then analysed using Naive Bayes (NB), K-nearest neighbour (KNN), logistic regression (LG), decision trees (DT), random forests (RF), and random oversampling technique. Bidirectional Gate Recurrent Unit (BiGRU) and Bidirectional Long Short-Term Memory (BiLSTM) improved to 99.99 % accuracy even with limited datasets.

Published in Intelligent Systems with Applications

ISSN: 2667-3053 (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Science: Science (General): Cybernetics; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.journals.elsevier.com/intelligent-systems-with-applications

About the journal

Abstract

Keywords