Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki (Jan 2014)
DOCUMENT REPRESENTATION FOR CLUSTERING OF SCIENTIFIC ABSTRACTS
Abstract
The key issue of the present paper is clustering of narrow-domain short texts, such as scientific abstracts. The work is based on the observations made when improving the performance of key phrase extraction algorithm. An extended stop-words list was used that was built automatically for the purposes of key phrase extraction and gave the possibility for a considerable quality enhancement of the phrases extracted from scientific publications. A description of the stop- words list creation procedure is given. The main objective is to investigate the possibilities to increase the performance and/or speed of clustering by the above-mentioned list of stop-words as well as information about lexeme parts of speech. In the latter case a vocabulary is applied for the document representation, which contains not all the words that occurred in the collection, but only nouns and adjectives or their sequences encountered in the documents. Two base clustering algorithms are applied: k-means and hierarchical clustering (average agglomerative method). The results show that the use of an extended stop-words list and adjective-noun document representation makes it possible to improve the performance and speed of k-means clustering. In a similar case for average agglomerative method a decline in performance quality may be observed. It is shown that the use of adjective-noun sequences for document representation lowers the clustering quality for both algorithms and can be justified only when a considerable reduction of feature space dimensionality is necessary.