Systems Science & Control Engineering (Dec 2024)
A large-scale Chinese patent dataset for information extraction
Abstract
Information extraction is an important foundation for automated patent analysis. Deep learning methods show promising results for information extraction, the performance of such methods heavily depends on the available corpus. To promote research on Chinese information extraction and evaluate the performance of related systems, we present a novel dataset, named CPIE, and make it publicly available. The dataset consisting of five thousands records of Chinese patent documents. The data were annotated by a tagging team using an on-line annotation tools. The dataset was evaluated using a state-of-the-art information extraction method that involves named entity recognition and relationship classification. The results shed light on new challenges and promote information extraction research.
Keywords