Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision

Anwar Aysa; Mijit Ablimit; Hankiz Yilahun; Askar Hamdulla

doi:10.3390/info13040175

Information (Mar 2022)

Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision

Anwar Aysa,
Mijit Ablimit,
Hankiz Yilahun,
Askar Hamdulla

Affiliations

Anwar Aysa: College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
Mijit Ablimit: College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
Hankiz Yilahun: College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
Askar Hamdulla: College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China

DOI: https://doi.org/10.3390/info13040175
Journal volume & issue: Vol. 13, no. 4
p. 175

Abstract

Read online

Bilingual lexicon extraction is useful, especially for low-resource languages that can leverage from high-resource languages. The Uyghur language is a derivative language, and its language resources are scarce and noisy. Moreover, it is difficult to find a bilingual resource to utilize the linguistic knowledge of other large resource languages, such as Chinese or English. There is little related research on unsupervised extraction for the Chinese-Uyghur languages, and the existing methods mainly focus on term extraction methods based on translated parallel corpora. Accordingly, unsupervised knowledge extraction methods are effective, especially for the low-resource languages. This paper proposes a method to extract a Chinese-Uyghur bilingual dictionary by combining the inter-word relationship matrix mapped by the neural network cross-language word embedding vector. A seed dictionary is used as a weak supervision signal. A small Chinese-Uyghur parallel data resource is used to map the multilingual word vectors into a unified vector space. As the word-particles of these two languages are not well-coordinated, stems are used as the main linguistic particles. The strong inter-word semantic relationship of word vectors is used to associate Chinese-Uyghur semantic information. Two retrieval indicators, such as nearest neighbor retrieval and cross-domain similarity local scaling, are used to calculate similarity to extract bilingual dictionaries. The experimental results show that the accuracy of the Chinese-Uyghur bilingual dictionary extraction method proposed in this paper is improved to 65.06%. This method helps to improve Chinese-Uyghur machine translation, automatic knowledge extraction, and multilingual translations.

Published in Information

ISSN: 2078-2489 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.mdpi.com/journal/information/

About the journal

Abstract

Keywords