Similar Text Fragments Extraction for Identifying Common Wikipedia Communities

Svitlana Petrasova; Nina Khairova; Włodzimierz Lewoniewski; Orken Mamyrbayev; Kuralay Mukhsina

doi:10.3390/data3040066

Data (Dec 2018)

Similar Text Fragments Extraction for Identifying Common Wikipedia Communities

Svitlana Petrasova,
Nina Khairova,
Włodzimierz Lewoniewski,
Orken Mamyrbayev,
Kuralay Mukhsina

Affiliations

Svitlana Petrasova: Department of Intelligent Computer Systems, National Technical University “Kharkiv Polytechnic Institute”, 61002 Kharkiv, Ukraine
Nina Khairova: Department of Intelligent Computer Systems, National Technical University “Kharkiv Polytechnic Institute”, 61002 Kharkiv, Ukraine
Włodzimierz Lewoniewski: Department of Information Systems, Poznan University of Economics and Business, 61-875 Poznan, Poland
Orken Mamyrbayev: Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan
Kuralay Mukhsina: Department of Informatics, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan

DOI: https://doi.org/10.3390/data3040066
Journal volume & issue: Vol. 3, no. 4
p. 66

Abstract

Read online

Similar text fragments extraction from weakly formalized data is the task of natural language processing and intelligent data analysis and is used for solving the problem of automatic identification of connected knowledge fields. In order to search such common communities in Wikipedia, we propose to use as an additional stage a logical-algebraic model for similar collocations extraction. With Stanford Part-Of-Speech tagger and Stanford Universal Dependencies parser, we identify the grammatical characteristics of collocation words. With WordNet synsets, we choose their synonyms. Our dataset includes Wikipedia articles from different portals and projects. The experimental results show the frequencies of synonymous text fragments in Wikipedia articles that form common information spaces. The number of highly frequented synonymous collocations can obtain an indication of key common up-to-date Wikipedia communities.

Published in Data

ISSN: 2306-5729 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Bibliography. Library science. Information resources
Website: http://www.mdpi.com/journal/data

About the journal

Abstract

Keywords