Character Detection and Segmentation of Historical Uchen Tibetan Documents in Complex Situations

Ce Zhang; Weilan Wang; Huaming Liu; Guowei Zhang; Qiang Lin

doi:10.1109/ACCESS.2022.3151886

IEEE Access (Jan 2022)

Character Detection and Segmentation of Historical Uchen Tibetan Documents in Complex Situations

Ce Zhang,
Weilan Wang,
Huaming Liu,
Guowei Zhang,
Qiang Lin

Affiliations

Ce Zhang: ORCiD; Key Laboratory of China’s Ethnic Languages and Information Technology, Ministry of Education, Northwest Minzu University, Lanzhou, China
Weilan Wang: ORCiD; Key Laboratory of China’s Ethnic Languages and Information Technology, Ministry of Education, Northwest Minzu University, Lanzhou, China
Huaming Liu: School of Computer and Information Engineering, Fuyang Normal University, Fuyang, China
Guowei Zhang: Key Laboratory of China’s Ethnic Languages and Information Technology, Ministry of Education, Northwest Minzu University, Lanzhou, China
Qiang Lin: ORCiD; Key Laboratory of Streaming Data Computing and Application, Northwest Minzu University, Lanzhou, China

DOI: https://doi.org/10.1109/ACCESS.2022.3151886
Journal volume & issue: Vol. 10
pp. 25376 – 25391

Abstract

Read online

Tibetan is a low-resource language, and Tibetan culture carried by historical Tibetan documents is an important part of Chinese civilization. The study of historical Tibetan documents is of great significance to the protection of Tibetan culture and the promotion of Chinese culture. Character segmentation is an important step in image analysis and recognition of historical Tibetan documents. However, the following three challenges prevent solving problems of character segmentation in historical Tibetan documents: 1) the text lines have different degrees of tilt and twist; 2) there are many complex situations such as overlapping, crossing, touching and breaking character strokes; and 3) these documents are written by different people with different stroke styles. To resolve these problems, we propose a character segmentation method based on key feature information for historical Tibetan documents. The proposed method consists of three parts: 1) projection and syllable point location information are used to shorten the text lines of historical Tibetan documents and establish a character block database; 2) the local baseline of the character block is detected by using the location information of syllable points or combined with horizontal projection and straight line detection, and the character block is divided into two areas above and below the baseline, and different segmentation methods are adopted; and 3) in view of the large difference in stroke styles, three stroke attribution distances are used to complete the attribution. The experimental results show that the method proposed in this paper can effectively solve the problem of character segmentation of historical Tibetan documents and achieve a better character segmentation effect, which also provides a reference for the relevant document character segmentation.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords