Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki (Jan 2017)

METHOD OF RARE TERM CONTRASTIVE EXTRACTION FROM NATURAL LANGUAGE TEXTS

  • I. A. Bessmertny,
  • A. B. Nugumanova,
  • M. Y. Mansurova,
  • Y. M. Baiburin

DOI
https://doi.org/10.17586/2226-1494-2017-17-1-81-91
Journal volume & issue
Vol. 17, no. 1
pp. 81 – 91

Abstract

Read online

The paper considers a problem of automatic domain term extraction from documents corpus by means of a contrast collection. Existing contrastive methods successfully extract often used terms but mishandle rare terms. This could yield poorness of the resulting thesaurus. Assessment of point-wise mutual information is one of the known statistical methods of term extraction and it finds rare terms successfully. Although, it extracts many false terms at that. The proposed approach consists of point-wise mutual information application for rare terms extraction and filtering of candidates by criterion of joint occurrence with the other candidates. We build “documents-by-terms” matrix that is subjected to singular value decomposition to eliminate noise and reveal strong interconnections. Then we pass on to the resulting matrix “terms-by-terms” that reproduces strength of interconnections between words. This approach was approved on a documents collection from “Geology” domain with the use of contrast documents from such topics as “Politics”, “Culture”, “Economics” and “Accidents” on some Internet resources. The experimental results demonstrate operability of this method for rare terms extraction.

Keywords