IEEE Access (Jan 2018)

Integrating Deep Learning Approaches for Identifying News Reprint Relation

  • Yin Luo,
  • Fangfang Wang,
  • Jun Chen,
  • Lei Wang,
  • Daniel Dajun Zeng

DOI
https://doi.org/10.1109/ACCESS.2018.2882624
Journal volume & issue
Vol. 6
pp. 72163 – 72172

Abstract

Read online

With the rapid development of big data and new media technologies, a large amount of original news is generated and reprinted on the Internet via news portals. Identifying news reprint relations is of great importance for the analysis of news diffusion patterns and copyright protection. However, the amount of news data on the Internet creates a huge challenge for efficiently identifying news reprint relation. Some existing studies focus on computing the similarity of the full text of news reports, which is not always effective, because some reprints only excerpt some sentences of the original news reports. The core challenge of improving identification accuracy is excavating the potential semantic relevance between news articles at the sentence level. Inspired by deep learning and semantic-based text representation models, this paper proposes an approach for identifying news reprint relation by integrating deep learning approaches. First, news reports that are not related to the topic of the original news report are removed via topic correlation mining. Then, the potential semantic relevance is excavated at the sentence level through the integration of semantic analysis methods, and reprint relations are identified between news reports. The performance of the approach is empirically evaluated using a real-world dataset. Experimental results show that the semantic analysis model integration allows us to mine in-depth semantic associations between news stories and accurately identify news reprint relations. These results benefit news diffusion pattern analysis and copyright protection.

Keywords