Semantic Non-Negative Matrix Factorization for Term Extraction

Aliya Nugumanova; Almas Alzhanov; Aiganym Mansurova; Kamilla Rakhymbek; Yerzhan Baiburin

doi:10.3390/bdcc8070072

Big Data and Cognitive Computing (Jun 2024)

Semantic Non-Negative Matrix Factorization for Term Extraction

Aliya Nugumanova,
Almas Alzhanov,
Aiganym Mansurova,
Kamilla Rakhymbek,
Yerzhan Baiburin

Affiliations

Aliya Nugumanova: Big Data and Blockchain Technologies Research Innovation Center, Astana IT University, Astana 010000, Kazakhstan
Almas Alzhanov: Big Data and Blockchain Technologies Research Innovation Center, Astana IT University, Astana 010000, Kazakhstan
Aiganym Mansurova: Big Data and Blockchain Technologies Research Innovation Center, Astana IT University, Astana 010000, Kazakhstan
Kamilla Rakhymbek: Laboratory of Digital Technologies and Modeling, Sarsen Amanzholov East Kazakhstan University, Ust-Kamenogorsk 070000, Kazakhstan
Yerzhan Baiburin: Laboratory of Digital Technologies and Modeling, Sarsen Amanzholov East Kazakhstan University, Ust-Kamenogorsk 070000, Kazakhstan

DOI: https://doi.org/10.3390/bdcc8070072
Journal volume & issue: Vol. 8, no. 7
p. 72

Abstract

Read online

This study introduces an unsupervised term extraction approach that combines non-negative matrix factorization (NMF) with word embeddings. Inspired by a pioneering semantic NMF method that employs regularization to jointly optimize document–word and word–word matrix factorizations for document clustering, we adapt this strategy for term extraction. Typically, a word–word matrix representing semantic relationships between words is constructed using cosine similarities between word embeddings. However, it has been established that transformer encoder embeddings tend to reside within a narrow cone, leading to consistently high cosine similarities between words. To address this issue, we replace the conventional word–word matrix with a word–seed submatrix, restricting columns to ‘domain seeds’—specific words that encapsulate the essential semantic features of the domain. Therefore, we propose a modified NMF framework that jointly factorizes the document–word and word–seed matrices, producing more precise encoding vectors for words, which we utilize to extract high-relevancy topic-related terms. Our modification significantly improves term extraction effectiveness, marking the first implementation of semantically enhanced NMF, designed specifically for the task of term extraction. Comparative experiments demonstrate that our method outperforms both traditional NMF and advanced transformer-based methods such as KeyBERT and BERTopic. To support further research and application, we compile and manually annotate two new datasets, each containing 1000 sentences, from the ‘Geography and History’ and ‘National Heroes’ domains. These datasets are useful for both term extraction and document classification tasks. All related code and datasets are freely available.

Published in Big Data and Cognitive Computing

ISSN: 2504-2289 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology
Website: http://www.mdpi.com/journal/BDCC

About the journal

Abstract

Keywords