Investigating text power in predicting semantic similarity

Zahra Yousefi; Hajar Sotudeh; Mahdieh Mirzabeigi; Seyed Mostafa Fakhrahmad; Alireza Nikseresht; Mehdi Mohammadi

International Journal of Information Science and Management (Jan 2019)

Investigating text power in predicting semantic similarity

Zahra Yousefi,
Hajar Sotudeh,
Mahdieh Mirzabeigi,
Seyed Mostafa Fakhrahmad,
Alireza Nikseresht,
Mehdi Mohammadi

Affiliations

Zahra Yousefi: Department of Knowledge & Information Sciences Faculty of Education & Psychology Eram Campus, Shiraz University, Shiraz, Iran
Hajar Sotudeh: Department of Knowledge & Information Sciences Faculty of Education & Psychology Eram Campus, Shiraz University, Shiraz, Iran
Mahdieh Mirzabeigi: Department of Knowledge & Information Sciences Faculty of Education & Psychology Eram Campus, Shiraz University, Shiraz, Iran
Seyed Mostafa Fakhrahmad: Department of Computer Science & Engineering School of Electrical and Computer Engineering Shiraz University, Shiraz, Iran
Alireza Nikseresht: Department of Knowledge & Information Sciences Faculty of Education & Psychology Eram Campus, Shiraz University, Shiraz, Iran
Mehdi Mohammadi: Department of Educational Management and Planning Faculty of Education & Psychology Eram Campus, Shiraz University, Shiraz, Iran

Journal volume & issue: Vol. 17, no. 1

Abstract

Read online

This article presents an empirical evaluation to investigate the distributional semantic power of abstract, body and full-text, as different text levels, in predicting the semantic similarity using a collection of open access articles from PubMed. The semantic similarity is measured based on two criteria namely, linear MeSH terms intersection and hierarchical MeSH terms distance. As such, a random sample of 200 queries and 20000 documents are selected from a test collection built on CITREC open source code. Sim Pack Java Library is used to calculate the textual and semantic similarities. The nDCG value corresponding to two of the semantic similarity criteria is calculated at three precision points. Finally, the nDCG values are compared by using the Friedman test to determine the power of each text level in predicting the semantic similarity. The results showed the effectiveness of the text in representing the semantic similarity in such a way that texts with maximum textual similarity are also shown to be 77% and 67% semantically similar in terms of linear and hierarchical criteria, respectively. Furthermore, the text length is found to be more effective in representing the hierarchical semantic compared to the linear one. Based on the findings, it is concluded that when the subjects are homogenous in the tree of knowledge, abstracts provide effective semantic capabilities, while in heterogeneous milieus, full-texts processing or knowledge bases is needed to acquire IR effectiveness.

Published in International Journal of Information Science and Management

ISSN: 2008-8302 (Print); 2008-8310 (Online)
Publisher: Regional Information Center for Science and Technology (RICeST)
Country of publisher: Iran, Islamic Republic of
LCC subjects: Bibliography. Library science. Information resources: Information resources (General); Social Sciences: Transportation and communications
Website: https://ijism.ricest.ac.ir/index.php/ijism

About the journal

Abstract

Keywords