Similarity Identification of Large-scale Biomedical Documents using Cosine Similarity and Parallel Computing

Merlinda Wibowo; Christoph Quix; Nur Syahela Hussien; Herman Yuliansyah; Faisal Dharma Adhinata

doi:10.17977/um018v4i22021p105-116

Knowledge Engineering and Data Science (Feb 2022)

Similarity Identification of Large-scale Biomedical Documents using Cosine Similarity and Parallel Computing

Merlinda Wibowo,
Christoph Quix,
Nur Syahela Hussien,
Herman Yuliansyah,
Faisal Dharma Adhinata

Affiliations

Merlinda Wibowo: Institut Teknologi Telkom Purwokerto
Christoph Quix: Information Systems & Data Science, Hochschule Niederrhein
Nur Syahela Hussien: Universiti Kuala Lumpur Malaysian Institute of Information Technology (UniKL MIIT)
Herman Yuliansyah: Informatics Department, Universitas Ahmad Dahlan
Faisal Dharma Adhinata: Institut Teknologi Telkom Purwokerto

DOI: https://doi.org/10.17977/um018v4i22021p105-116
Journal volume & issue: Vol. 4, no. 2
pp. 105 – 116

Abstract

Read online

Document similarity computation is an important research topic in information retrieval, and it is a crucial issue for automatic document categorization. The similarity value is between 0 and 1, then the closest value to 1 is represented both documents is considered more relevant, vice versa. However, the large scale of textual information has created the problem of finding the relevance level between documents. Therefore, the relevance between mesh heading text in the PubMed documents is higher than the relevance of the abstract text in the PubMed documents. Furthermore, parallel computing is implemented to speed up the large-scale documents similarity identification process that automatically calculates in the PubMed application. The execution time of mesh heading is 15.447 seconds, and the timely execution of abstract is 74.191 seconds. The execution time of mesh heading is higher than abstract because abstract contains more words than mesh heading. This study has successfully identified the similarity between large-scale biomedical documents of the PubMed documents that implemented a cosine similarity algorithm. The result has shown that the cosine similarity of the mesh heading texts is higher than the abstract text in the form of a graph and table shown in the PubMed application. The cosine similarity is useful to measure the similarity between documents based on the TF*IDF calculation result.

Published in Knowledge Engineering and Data Science

ISSN: 2597-4602 (Print); 2597-4637 (Online)
Publisher: Universitas Negeri Malang
Country of publisher: Indonesia
LCC subjects: Bibliography. Library science. Information resources: Information resources (General); Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://journal2.um.ac.id/index.php/keds/index

About the journal