Applied Sciences (Aug 2025)

Information Extraction from Multi-Domain Scientific Documents: Methods and Insights

  • Tatiana Batura,
  • Aigerim Yerimbetova,
  • Nurzhan Mukazhanov,
  • Nikita Shvarts,
  • Bakzhan Sakenov,
  • Mussa Turdalyuly

DOI
https://doi.org/10.3390/app15169086
Journal volume & issue
Vol. 15, no. 16
p. 9086

Abstract

Read online

The rapid growth of scientific literature necessitates effective information extraction. However, existing methods face significant challenges, particularly when applied to multi-domain documents and low-resource languages. For Kazakh and Russian, there is a notable lack of annotated corpora and dedicated tools for scientific information extraction. To address this gap, we introduce SciMDIX (Scientific Multi-Domain Information extraction), a novel multi-domain dataset of scientific documents in Russian and Kazakh, annotated with entities and relations. Our study includes a comprehensive evaluation of entity recognition performance, comparing state-of-the-art models, such as BERT, LLaMA, GLiNER, and spaCy across four diverse domains (IT, Linguistics, Medicine, and Psychology) in both languages. The findings highlight the promise of spaCy and GLiNER for practical deployment in under-resourced language settings. Furthermore, we propose a new zero-shot relation extraction model that leverages a multimodal representation by integrating sentence context, entity mentions, and textual definitions of relation classes. Our model can predict semantic relations between entities in new documents, even for a language encountered during training. This capability is especially valuable for low-resource language scenarios.

Keywords