Journal of Intelligent Systems (Dec 2018)

Enriching Documents by Linking Salient Entities and Lexical-Semantic Expansion

  • Pourvali Mohsen,
  • Orlando Salvatore

DOI
https://doi.org/10.1515/jisys-2018-0098
Journal volume & issue
Vol. 29, no. 1
pp. 1109 – 1121

Abstract

Read online

This paper explores a multi-strategy technique that aims at enriching text documents for improving clustering quality. We use a combination of entity linking and document summarization in order to determine the identity of the most salient entities mentioned in texts. To effectively enrich documents without introducing noise, we limit ourselves to the text fragments mentioning the salient entities, in turn, belonging to a knowledge base like Wikipedia, while the actual enrichment of text fragments is carried out using WordNet. To feed clustering algorithms, we investigate different document representations obtained using several combinations of document enrichment and feature extraction. This allows us to exploit ensemble clustering, by combining multiple clustering results obtained using different document representations. Our experiments indicate that our novel enriching strategies, combined with ensemble clustering, can improve the quality of classical text clustering when applied to text corpora like The British Broadcasting Corporation (BBC) NEWS.

Keywords