Journal of Applied Computer Science & Mathematics (Jan 2012)

Automatic Clustering of e-Commerce Product Description

  • Haytham SALEEM Al-SARRAYRIH,
  • Lars KNIPPING,
  • Carmen PETCU

Journal volume & issue
Vol. 6, no. 13
pp. 48 – 60

Abstract

Read online

Resolving the issue of storing large amounts of digital information is a challenge, searching for a certain object within a tremendous amount of data is like looking for a needle in a haystack. The increase in size and diversity of stored data makes the retrieval of the information needed more and more difficult. This research describes the use of clustering techniques and mathematical models in the field of information retrieval when dealing with text documents. In this study, the traditional clustering and clustering extended by LSA are compared by applying them on the preprocessed text corpus using the weighted centroid clustering algorithm and the cosine similarity to measure the documents' correlation. LSA is assumed to improve the clustering by bringing related words closer in a conceptual space. It is deduced that the clustering depends on the document representation and the similarity measure used. When dealing with short documents, LSA does not bring yield improved results compared to the traditional clustering techniques. The recall value is nevertheless higher because of the increased number of related documents returned. However, the results are less accurate than with traditional techniques.

Keywords