IEEE Access (Jan 2022)

ZoomNet for Topic-Oriented Tragment Recognition in Long Documents

  • Yukun Yan,
  • Daqi Zheng,
  • Zhengdong Lu,
  • Sen Song

DOI
https://doi.org/10.1109/ACCESS.2022.3166235
Journal volume & issue
Vol. 10
pp. 39545 – 39554

Abstract

Read online

This work introduces a new information extraction task called Topic-Oriented Fragment Recognition (TOFR), whose goal is to recognize information related to a specific topic in long documents from professional fields. In this paper, we introduce two TOFR datasets to study the problems of processing long documents. We propose a novel neural framework named Zooming Network (ZoomNet), which overcomes the challenge of combining information over long distances with limited computing resources by flexibly switching between skimming and intensive reading in processing long documents. In general, ZoomNet first establishes a hierarchical representation aligned to the text structure, which relieves the conflict between local information and extensive contextual information. Then, it synthesizes different levels of information to assign tags via multi-scale actions. We combine supervised and reinforcement learning methods to train our model. Experiments show that the proposed model outperforms several state-of-the-art sequence labeling models, including BiLSTM-CRF, BERT, XLNET, RoBERTa, and ELECTRA, on both TOFR datasets with big margins.

Keywords