Scientific Data (Nov 2024)
Joint variation and ZhuYin dataset for Traditional Chinese document enhancement
Abstract
Abstract Digital documents play a crucial role in contemporary information management. However, their quality can be significantly impacted by various factors such as hand-drawn annotations, image distortion, watermarks, stains, and degradation. Deep learning-based methods have emerged as powerful tools for document enhancement. However, their effectiveness relies heavily on the availability of high-quality training and evaluation datasets. Unfortunately, such benchmark datasets are relatively scarce, particularly in the domain of Traditional Chinese documents. We introduce a novel dataset termed “Joint Variation and ZhuYin dataset (JVZY)” to address this gap. This dataset comprises 20,000 images and 1.92 million words, encompassing various document degradation characteristics. It also includes unique phonetic symbols in Traditional Chinese, catering to the specific localization requirements. By releasing this dataset, we aim to construct a continuously evolving resource explicitly tailored to the diverse needs of Traditional Chinese document enhancement. This resource aims to facilitate the development of applications that can effectively address the challenges posed by unique phonetic symbols and varied file degradation characteristics encountered in Traditional Chinese documents.