A Benchmark of Parsing Vietnamese Publications

Khang Nguyen; Thuan Trong Nguyen; Thuan Q. Nguyen; An Nguyen; Nguyen D. Vo; Tam V. Nguyen

doi:10.1109/ACCESS.2022.3183193

IEEE Access (Jan 2022)

A Benchmark of Parsing Vietnamese Publications

Khang Nguyen,
Thuan Trong Nguyen,
Thuan Q. Nguyen,
An Nguyen,
Nguyen D. Vo,
Tam V. Nguyen

Affiliations

Khang Nguyen: ORCiD; University of Information Technology, Ho Chi Minh City, Vietnam
Thuan Trong Nguyen: ORCiD; University of Information Technology, Ho Chi Minh City, Vietnam
Thuan Q. Nguyen: ORCiD; University of Information Technology, Ho Chi Minh City, Vietnam
An Nguyen: University of Information Technology, Ho Chi Minh City, Vietnam
Nguyen D. Vo: ORCiD; University of Information Technology, Ho Chi Minh City, Vietnam
Tam V. Nguyen: ORCiD; University of Dayton, Ohio, OH, USA

DOI: https://doi.org/10.1109/ACCESS.2022.3183193
Journal volume & issue: Vol. 10
pp. 65284 – 65299

Abstract

Read online

In recent decades, digital transformation has received growing attention worldwide, that has leveraged the explosion of digitized document data. In this paper, we address the problem of parsing publications, in particular, Vietnamese publications. The Vietnamese publications are well-known with high variant, diverse layouts, and some characters are equivocal in the visual form due to accent symbols and derivative characters that pose many challenges. To this end, we collect the UIT-DODV-Ext dataset: a challenging Vietnamese document image including scientific papers and textbooks with 5,000 fully annotated images. We introduce a general framework to parse Vietnamese publications containing two components: page object detection and caption recognition. We further conduct an extensive benchmark with various state-of-the-art object detection and text recognition methods. Finally, we present a hybrid parser which achieves the top place in the benchmark. Extensive experiments on the UIT-DODV-Ext dataset provide a comprehensive evaluation and insightful analysis.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords