Features for Forming Text Corpus of Kazakhstan Electronic News

Ulzhan Ospanova; Mukhit Baimakhanbetov; Inessa Akoyeva; Timur Buldybayev; Miraim Atanayeva

doi:10.25559/SITITO.16.202001.90-98

Современные информационные технологии и IT-образование (May 2020)

Features for Forming Text Corpus of Kazakhstan Electronic News

Ulzhan Ospanova,
Mukhit Baimakhanbetov,
Inessa Akoyeva,
Timur Buldybayev,
Miraim Atanayeva

Affiliations

Ulzhan Ospanova: ORCiD; "Information-Analytical Center", JSC
Mukhit Baimakhanbetov: ORCiD; "Information-Analytical Center", JSC
Inessa Akoyeva: ORCiD; "Information-Analytical Center", JSC
Timur Buldybayev: ORCiD; "Information-Analytical Center", JSC
Miraim Atanayeva: ORCiD; "Information-Analytical Center", JSC

DOI: https://doi.org/10.25559/SITITO.16.202001.90-98
Journal volume & issue: Vol. 16, no. 1
pp. 90 – 98

Abstract

Read online

The culture of online-news consumption continues to take shape and is gaining popularity, increasing the audience of readers. At the same time, the number of those who fall under the negative influence of false news is growing. Researchers are faced with the task of analyzing mass media. One of the areas of news content analysis is thematic modelling, recognition of fake news, sentiment analysis. However, to research these areas, there is a need in a labelled corpus. This paper presents the methodological foundations of the corpus formation. It describes the process of data collection and the selection of sources to form the corpus. It also presents a description of the theoretical foundations of representativeness and balance and explains compliance of the corpus with the requirements. In the course of the composite work, authors gained a corpus of 1.9 million news texts from 22 news sources. They conducted corpus markup and carried-up the analysis of the thematic structure of the formed corps using the LDA model. The formed corpus will allow testing machine learning algorithms aimed at recognizing individual informative features and identifying patterns that are present in the array of news publications. Also, the corpus will be useful to machine learning and NLP researchers to test machine learning algorithms according to their own goals.

Published in Современные информационные технологии и IT-образование

ISSN: 2411-1473 (Print)
Publisher: The Fund for Promotion of Internet media, IT education, human development «League Internet Media»
Country of publisher: Russian Federation
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://sitito.cs.msu.ru

About the journal

Abstract

Keywords