Novelty Detection: A Perspective from Natural Language Processing

Tirthankar Ghosal; Tanik Saikh; Tameesh Biswas; Asif Ekbal; Pushpak Bhattacharyya

doi:10.1162/coli_a_00429

Computational Linguistics (Jan 2022)

Novelty Detection: A Perspective from Natural Language Processing

Tirthankar Ghosal,
Tanik Saikh,
Tameesh Biswas,
Asif Ekbal,
Pushpak Bhattacharyya

Affiliations

Tirthankar Ghosal: Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic. [email protected]
Tanik Saikh: Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, India. [email protected]
Tameesh Biswas: Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, India. [email protected]
Asif Ekbal: Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, India. [email protected]
Pushpak Bhattacharyya: Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Powai, India. [email protected]

DOI: https://doi.org/10.1162/coli_a_00429
Journal volume & issue: Vol. 48, no. 1
pp. 77 – 117

Abstract

Read online

AbstractThe quest for new information is an inborn human trait and has always been quintessential for human survival and progress. Novelty drives curiosity, which in turn drives innovation. In Natural Language Processing (NLP), Novelty Detection refers to finding text that has some new information to offer with respect to whatever is earlier seen or known. With the exponential growth of information all across the Web, there is an accompanying menace of redundancy. A considerable portion of the Web contents are duplicates, and we need efficient mechanisms to retain new information and filter out redundant information. However, detecting redundancy at the semantic level and identifying novel text is not straightforward because the text may have less lexical overlap yet convey the same information. On top of that, non-novel/redundant information in a document may have assimilated from multiple source documents, not just one. The problem surmounts when the subject of the discourse is documents, and numerous prior documents need to be processed to ascertain the novelty/non-novelty of the current one in concern. In this work, we build upon our earlier investigations for document-level novelty detection and present a comprehensive account of our efforts toward the problem. We explore the role of pre-trained Textual Entailment (TE) models to deal with multiple source contexts and present the outcome of our current investigations. We argue that a multipremise entailment task is one close approximation toward identifying semantic-level non-novelty. Our recent approach either performs comparably or achieves significant improvement over the latest reported results on several datasets and across several related tasks (paraphrasing, plagiarism, rewrite). We critically analyze our performance with respect to the existing state of the art and show the superiority and promise of our approach for future investigations. We also present our enhanced dataset TAP-DLND 2.0 and several baselines to the community for further research on document-level novelty detection.

Published in Computational Linguistics

ISSN: 0891-2017 (Print); 1530-9312 (Online)
Publisher: The MIT Press
Country of publisher: United States
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing
Website: https://direct.mit.edu/coli

About the journal