Journal of Big Data (Jul 2017)

Improved sqrt-cosine similarity measurement

  • Sahar Sohangir,
  • Dingding Wang

DOI
https://doi.org/10.1186/s40537-017-0083-6
Journal volume & issue
Vol. 4, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Text similarity measurement aims to find the commonality existing among text documents, which is fundamental to most information extraction, information retrieval, and text mining problems. Cosine similarity based on Euclidean distance is currently one of the most widely used similarity measurements. However, Euclidean distance is generally not an effective metric for dealing with probabilities, which are often used in text analytics. In this paper, we propose a new similarity measure based on sqrt-cosine similarity. We apply the proposed improved sqrt-cosine similarity to a variety of document-understanding tasks, such as text classification, clustering, and query search. Comprehensive experiments are then conducted to evaluate our new similarity measurement in comparison to existing methods. These experimental results show that our proposed method is indeed effective.

Keywords