Journal of Engineering Science and Technology (Dec 2017)
IDENTIFYING DOCUMENT-LEVEL TEXT PLAGIARISM: A TWO-PHASE APPROACH
Abstract
The rapid evolution of information content and its ease of access have made the field of research and academia so vulnerable to plagiarism. Plagiarism is an act of intellectual theft and information breach which must be restricted to ensure educational integrity. Usually in plagiarism checking, exhaustive document comparisons with large repositories and databases have to be done. The paper presents a two phase document retrieval approach which can effectively reduce the search space for plagiarism detection task. An initial heuristic retrieval process is carried out before the actual exhaustive analysis to retrieve the globally similar documents or the near duplicates corresponding to the suspected document. The work proposes a two phase candidate retrieval approach for an offline plagiarism detection system that can identify the plagiarized sources having different complexity levels. This means that we already have the source document data base offline and hence the work is not focusing on online source retrieval. It explores and integrates the prospective aspects of document ranking approaches with vector space model in first phase and N-gram models in second phase for candidate refinement stage. The proposed approach is evaluated on the standard plagiarism corpus provided by PAN-14 text alignment data set and the efficiency is analyzed using the standard IR measures, viz., precision, recall and F1-score. Comparison is done with the vector space model and N-gram models to analyse the performance efficiency. Further statistical analysis is done using paired t-test with means of F1-scores of these techniques over the samples extracted from the PAN-14 set. Experimental results show that the proposed two phase candidate selection approach outperforms the compared models specifically when it comes to comparison and retrieval of complex and manipulated text.