Automatic Detection of Plagiarism in Writing

Mahshad Davoodifard

Studies in Applied Linguistics & TESOL (Jan 2022)

Automatic Detection of Plagiarism in Writing

Mahshad Davoodifard

Affiliations

Mahshad Davoodifard: Teachers College, Columbia University

Journal volume & issue: Vol. 21, no. 2

Abstract

Read online

This paper reports on preliminary steps to create an external plagiarism detection tool. I used the PAN-PC-11 data sets and extracted tf-idf scores of text documents and cosine similarity measures between source and suspicious documents to find text overlap. The model was able to successfully create vectors and measure the similarity metrics. However, the algorithm was not extended further to automatically retrieve related documents to follow on the pipeline (converting texts to n-grams for detailed analysis and revealing the best match as a source of plagiarism and evaluating the accuracy of the model). The model produced a matrix of cosine similarity for all the documents, which I used to manually retrieve documents and check for overlap using online tools. While extending the algorithm based on the suggested pipeline would allow for a more accurate evaluation of the model, manual comparison of sample documents provided some validity of the model developed for the present study.

Published in Studies in Applied Linguistics & TESOL

ISSN: 2689-193X (Online)
Publisher: Columbia University Libraries
Country of publisher: United States
LCC subjects: Education: Theory and practice of education; Language and Literature: English language
Website: https://tesolal.columbia.edu/

About the journal

Abstract

Keywords