Nongye tushu qingbao xuebao (Sep 2021)
A Fine-grained Extraction Method of Chapter Structure of Documents Based on PDF Layout Features
Abstract
[Purpose/Significance] This paper proposes a fine-grained automatic extraction method for document structure based on PDF layout features, in order to realize fine-grained organization of literature resources and meet the increasingly growing needs of users for accurate information services. [Method/Process] The method takes full advantage of machine learning in information classification, which can automatically analyze, identify and extract the chapter title of unstructured PDF documents based on layout features. And according to the coordinate positioning of chapter titles, the body content is automatically matched to the subordinated position of the title with paragraph as the minimum granularity, and the fine-grained extraction and identification of the full text of the document is finally realized. [Results/Conclusions] The test result shows that the average accuracy of automatic extraction can reach 80%. The method of fine-grained extraction of unstructured PDF documents proposed has practical significance and application prospect, and the data processing system designed based on the underlying method has been put into practical application, which will greatly liberate us from the mechanical drudgery of chapter structure extraction tasks.
Keywords