Journal of Information Science Theory and Practice (Mar 2015)

Query Formulation for Heuristic Retrieval in Obfuscated and Translated Partially Derived Text

  • Kumar, Aarti,
  • Das, Sujoy

DOI
https://doi.org/10.1633/JISTaP.2015.3.1.2
Journal volume & issue
Vol. 3, no. 1
pp. 24 – 39

Abstract

Read online

Pre-retrieval query formulation is an important step for identifying local text reuse. Local reuse with high obfuscation, paraphrasing, and translation poses a challenge of finding the reused text in a document. In this paper, three pre-retrieval query formulation strategies for heuristic retrieval in case of low obfuscated, high obfuscated, and translated text are studied. The strategies used are (a) Query formulation using proper nouns; (b) Query formulation using unique words (Hapax); and (c) Query formulation using most frequent words. Whereas in case of low and high obfuscation and simulated paraphrasing, keywords with Hapax proved to be slightly more efficient, initial results indicate that the simple strategy of query formulation using proper nouns gives promising results and may prove better in reducing the size of the corpus for post processing, for identifying local text reuse in case of obfuscated and translated text reuse.

Keywords