Learning bilingual word embedding for automatic text summarization in low resource language

Rini Wijayanti; Masayu Leylia Khodra; Kridanto Surendro; Dwi H. Widyantoro

Journal of King Saud University: Computer and Information Sciences (Apr 2023)

Learning bilingual word embedding for automatic text summarization in low resource language

Rini Wijayanti,
Masayu Leylia Khodra,
Kridanto Surendro,
Dwi H. Widyantoro

Affiliations

Rini Wijayanti: School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Indonesia; Corresponding author at: School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung 40132, Indonesia.
Masayu Leylia Khodra: School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Indonesia; University Center of Excellence on Artificial Intelligence for Vision, Natural Language Processing & Big Data Analytics (U-CoE AI-VLB), Institut Teknologi Bandung, Bandung, Indonesia
Kridanto Surendro: School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Indonesia
Dwi H. Widyantoro: School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Indonesia; University Center of Excellence on Artificial Intelligence for Vision, Natural Language Processing & Big Data Analytics (U-CoE AI-VLB), Institut Teknologi Bandung, Bandung, Indonesia

Journal volume & issue: Vol. 35, no. 4
pp. 224 – 235

Abstract

Read online

Studies in low-resource languages have become more challenging with the increasing volume of texts in today's digital era. Also, the lack of labeled data and text processing libraries in a language further widens the research gap between high and low-resource languages, such as English and Indonesian. This has led to the use of a transfer learning approach, which applies pre-trained models to solve similar problems, even in different languages by using bilingual or cross-lingual word embedding. Therefore, this study aims to investigate two bilingual word embedding methods, namely VecMap and BiVec, for Indonesian – English language and evaluates them for bilingual lexicon induction and text summarization tasks. The generated bilingual embedding was compared with MUSE (Multilingual Unsupervised and Supervised Embeddings) as the existing multilingual word created with the generative adversarial network method. Furthermore, the VecMap was improved by creating shared vocabulary spaces and mapping the unshared ones between languages. The result showed the embedding produced by the joint methods of BiVec, performed better in intrinsic evaluation, especially with CSLS (Cross-Domain Similarity Local Scaling) retrieval. Meanwhile, the improved VecMap outperformed the regular type by 16.6% without surpassing the BiVec evaluation score. These methods enabled model transfer between languages when applied to cross-lingual-based text summarization. Moreover, the ROUGE score outperformed classical text summarization by adding only 10% of the training dataset of the target language.

Published in Journal of King Saud University: Computer and Information Sciences

ISSN: 1319-1578 (Print)
Publisher: Elsevier
Country of publisher: Saudi Arabia
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.journals.elsevier.com/journal-of-king-saud-university-computer-and-information-sciences/

About the journal

Abstract

Keywords