Cross-Language Plagiarism Detection System Using Latent Semantic Analysis and Learning Vector Quantization

Anak Agung Putri Ratna; Prima Dewi Purnamasari; Boma Anantasatya Adhi; F. Astha Ekadiyanto; Muhammad Salman; Mardiyah Mardiyah; Darien Jonathan Winata

doi:10.3390/a10020069

Algorithms (Jun 2017)

Cross-Language Plagiarism Detection System Using Latent Semantic Analysis and Learning Vector Quantization

Anak Agung Putri Ratna,
Prima Dewi Purnamasari,
Boma Anantasatya Adhi,
F. Astha Ekadiyanto,
Muhammad Salman,
Mardiyah Mardiyah,
Darien Jonathan Winata

Affiliations

Anak Agung Putri Ratna: Department of Electrical Engineering, Faculty of Enginering, Universitas Indonesia, Depok 16424, Indonesia
Prima Dewi Purnamasari: Department of Electrical Engineering, Faculty of Enginering, Universitas Indonesia, Depok 16424, Indonesia
Boma Anantasatya Adhi: Department of Electrical Engineering, Faculty of Enginering, Universitas Indonesia, Depok 16424, Indonesia
F. Astha Ekadiyanto: Department of Electrical Engineering, Faculty of Enginering, Universitas Indonesia, Depok 16424, Indonesia
Muhammad Salman: Department of Electrical Engineering, Faculty of Enginering, Universitas Indonesia, Depok 16424, Indonesia
Mardiyah Mardiyah: Department of Electrical Engineering, Faculty of Enginering, Universitas Indonesia, Depok 16424, Indonesia
Darien Jonathan Winata: Department of Electrical Engineering, Faculty of Enginering, Universitas Indonesia, Depok 16424, Indonesia

DOI: https://doi.org/10.3390/a10020069
Journal volume & issue: Vol. 10, no. 2
p. 69

Abstract

Read online

Computerized cross-language plagiarism detection has recently become essential. With the scarcity of scientific publications in Bahasa Indonesia, many Indonesian authors frequently consult publications in English in order to boost the quantity of scientific publications in Bahasa Indonesia (which is currently rising). Due to the syntax disparity between Bahasa Indonesia and English, most of the existing methods for automated cross-language plagiarism detection do not provide satisfactory results. This paper analyses the probability of developing Latent Semantic Analysis (LSA) for a computerized cross-language plagiarism detector for two languages with different syntax. To improve performance, various alterations in LSA are suggested. By using a linear vector quantization (LVQ) classifier in the LSA and taking into account the Frobenius norm, output has reached up to 65.98% in accuracy. The results of the experiments showed that the best accuracy achieved is 87% with a document size of 6 words, and the document definition size must be kept below 10 words in order to maintain high accuracy. Additionally, based on experimental results, this paper suggests utilizing the frequency occurrence method as opposed to the binary method for the term–document matrix construction.

Published in Algorithms

ISSN: 1999-4893 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.mdpi.com/journal/algorithms

About the journal

Abstract

Keywords