Morpheme Embedding for Bahasa Indonesia Using Modified Byte Pair Encoding

Amalia Amalia; Opim Salim Sitompul; Teddy Mantoro; Erna Budhiarti Nababan

doi:10.1109/ACCESS.2021.3128439

IEEE Access (Jan 2021)

Morpheme Embedding for Bahasa Indonesia Using Modified Byte Pair Encoding

Amalia Amalia,
Opim Salim Sitompul,
Teddy Mantoro,
Erna Budhiarti Nababan

Affiliations

Amalia Amalia: ORCiD; Department of Computer Science, Universitas Sumatera Utara, Medan, Indonesia
Opim Salim Sitompul: ORCiD; Department of Information Technology, Universitas Sumatera Utara, Medan, Indonesia
Teddy Mantoro: Department of Computer Science, Sampoerna University, Jakarta, Indonesia
Erna Budhiarti Nababan: ORCiD; Department of Information Technology, Universitas Sumatera Utara, Medan, Indonesia

DOI: https://doi.org/10.1109/ACCESS.2021.3128439
Journal volume & issue: Vol. 9
pp. 155699 – 155710

Abstract

Read online

Word embedding is an efficient feature representation that carries semantic and syntactic information. Word embedding works as a word level that treats words as minor independent entity units and cannot handle words that are not in the training corpus. One solution is to generate embedding from more minor parts of words such as morphemes. Morphemes are the smallest part of a word linguistic that has meaning in the grammatical unit of languages. This study aims to build a morpheme embedding model for Bahasa Indonesia (in English: Indonesian Language) in sort: Bahasa. However, there were many morphological rules in Bahasa, such as inflectional and derivational affixes. This implies that all rules in word segmentation will increase the computational complexity. Moreover, the rules were not regular and similar for all words in Bahasa. Therefore, this study modified a Byte Pair Embedding (BPE) algorithm to generate morpheme embedding appropriate to the morphology of Bahasa. The study implemented a simple method by filtering the BPE segmentation results with the list of Bahasa’s morphemes. This process has proven to anticipate the limitation of a conventional BPE algorithm that produces intermediate junk tokens that are not meaningful. Based on three evaluation scenarios, the model in the study can handle OOV and carry semantic and syntactic information in the embedding value of the words.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords