Information Extraction from Multi-Domain Scientific Documents: Methods and Insights

Tatiana Batura; Aigerim Yerimbetova; Nurzhan Mukazhanov; Nikita Shvarts; Bakzhan Sakenov; Mussa Turdalyuly

doi:10.3390/app15169086

Applied Sciences (Aug 2025)

Information Extraction from Multi-Domain Scientific Documents: Methods and Insights

Tatiana Batura,
Aigerim Yerimbetova,
Nurzhan Mukazhanov,
Nikita Shvarts,
Bakzhan Sakenov,
Mussa Turdalyuly

Affiliations

Tatiana Batura: Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan
Aigerim Yerimbetova: Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan
Nurzhan Mukazhanov: Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan
Nikita Shvarts: Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan
Bakzhan Sakenov: Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan
Mussa Turdalyuly: Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan

DOI: https://doi.org/10.3390/app15169086
Journal volume & issue: Vol. 15, no. 16
p. 9086

Abstract

Read online

The rapid growth of scientific literature necessitates effective information extraction. However, existing methods face significant challenges, particularly when applied to multi-domain documents and low-resource languages. For Kazakh and Russian, there is a notable lack of annotated corpora and dedicated tools for scientific information extraction. To address this gap, we introduce SciMDIX (Scientific Multi-Domain Information extraction), a novel multi-domain dataset of scientific documents in Russian and Kazakh, annotated with entities and relations. Our study includes a comprehensive evaluation of entity recognition performance, comparing state-of-the-art models, such as BERT, LLaMA, GLiNER, and spaCy across four diverse domains (IT, Linguistics, Medicine, and Psychology) in both languages. The findings highlight the promise of spaCy and GLiNER for practical deployment in under-resourced language settings. Furthermore, we propose a new zero-shot relation extraction model that leverages a multimodal representation by integrating sentence context, entity mentions, and textual definitions of relation classes. Our model can predict semantic relations between entities in new documents, even for a language encountered during training. This capability is especially valuable for low-resource language scenarios.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords