International Journal of Networked and Distributed Computing (IJNDC) (Jan 2016)

Topic Model based Approach for Improved Indexing in Content based Document Retrieval

  • Moon Soo Cha,
  • So Yeon Kim,
  • Jae Hee Ha,
  • Min-June Lee,
  • Young-June Choi,
  • Kyung-Ah Sohn

DOI
https://doi.org/10.2991/ijndc.2016.4.1.6
Journal volume & issue
Vol. 4, no. 1

Abstract

Read online

Information Retrieval system plays an essential role in web services. However, the web services in which users can upload files as attachments typically do not support enough search conditions and often rely only on the title or the description that the users provide during upload. We present a topic-model based framework for fast and effective Content Based Document Information Retrieval that retrieves the information from the actual contents in the attachment. The proposed systems is analyzed and compared with conventional methods in various aspects. In particular, we propose an efficient keyword extraction method based on Latent Dirichlet Allocation which is compared with the Term Frequency Inverse Document Frequency approach typically used in conventional systems. Moreover, a per-category indexing structure is also proposed and compared with the existing total indexing scheme. Our experimental results validate the utility of the proposed system for web services that can upload document attachments.

Keywords