Automatic term acquisition from domain-specific text collection by using Wikipedia

N. Astrakhantsev

doi:10.15514/ISPRAS-2014-26(4)-1

Труды Института системного программирования РАН (Oct 2018)

Automatic term acquisition from domain-specific text collection by using Wikipedia

N. Astrakhantsev

Affiliations

N. Astrakhantsev: ИСП РАН

DOI: https://doi.org/10.15514/ISPRAS-2014-26(4)-1
Journal volume & issue: Vol. 26, no. 4
pp. 7 – 20

Abstract

Read online

Automatic term acquisition is an important task for many applications related to domain-specific texts processing. At present there are many methods for automatic term acquisition, but they are highly dependent on language and domain of input text collection. Also these methods, in general, use domain-specific text collection only, while many external resources are underutilized. We argue that one of the most promising external resources for automatic term acquisition is the online encyclopedia Wikipedia. In this paper we propose two new features: "Hyperlink probability" - normalized frequency showing how often the candidate terms is a hyperlink in Wikipedia articles; and "Semantic relatedness to the domain key concepts" - arithmetic mean of semantic relatedness to the key concepts of a given domain; those key concepts are determined automatically on the basis of input domain-specific text collection. In addition, we propose a new method for automatic term acquisition. It is based on semi-supervised machine learning algorithm, but it does not require labeled data. Outline of the method is to extract the best 100-300 candidates presented in Wikipedia by using a special method for term acquisition, and then to use these candidates as positive examples to construct a model for a classifier based on positive-unlabeled learning algorithm. An experimental evaluation conducted for the four domains (board games, biomedicine, computer science, agriculture) shows that the proposed method significantly outperforms existed one and is domain-independent: the average precision is higher by 5-17% than that of the best method for a particular data set.

Published in Труды Института системного программирования РАН

ISSN: 2079-8156 (Print); 2220-6426 (Online)
Publisher: Ivannikov Institute for System Programming of the Russian Academy of Sciences
Country of publisher: Russian Federation
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://ispranproceedings.elpub.ru/jour/index

About the journal

Abstract

Keywords