A Syllable-Based Technique for Uyghur Text Compression

Wayit Abliz; Hao Wu; Maihemuti Maimaiti; Jiamila Wushouer; Kahaerjiang Abiderexiti; Tuergen Yibulayin; Aishan Wumaier

doi:10.3390/info11030172

Information (Mar 2020)

A Syllable-Based Technique for Uyghur Text Compression

Wayit Abliz,
Hao Wu,
Maihemuti Maimaiti,
Jiamila Wushouer,
Kahaerjiang Abiderexiti,
Tuergen Yibulayin,
Aishan Wumaier

Affiliations

Wayit Abliz: School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
Hao Wu: Xinjiang Laboratory of Multi-Language Information Technology, Xinjiang University, Urumqi 830046, China
Maihemuti Maimaiti: School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
Jiamila Wushouer: School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
Kahaerjiang Abiderexiti: School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
Tuergen Yibulayin: School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
Aishan Wumaier: School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China

DOI: https://doi.org/10.3390/info11030172
Journal volume & issue: Vol. 11, no. 3
p. 172

Abstract

Read online

To improve utilization of text storage resources and efficiency of data transmission, we proposed two syllable-based Uyghur text compression coding schemes. First, according to the statistics of syllable coverage of the corpus text, we constructed a 12-bit and 16-bit syllable code tables and added commonly used symbols—such as punctuation marks and ASCII characters—to the code tables. To enable the coding scheme to process Uyghur texts mixed with other language symbols, we introduced a flag code in the compression process to distinguish the Unicode encodings that were not in the code table. The experiments showed that the 12-bit coding scheme had an average compression ratio of 0.3 on Uyghur text less than 4 KB in size and that the 16-bit coding scheme had an average compression ratio of 0.5 on text less than 2 KB in size. Our compression schemes outperformed GZip, BZip2, and the LZW algorithm on short text and could be effectively applied to the compression of Uyghur short text for storage and applications.

Published in Information

ISSN: 2078-2489 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.mdpi.com/journal/information/

About the journal

Abstract

Keywords