Proceedings of the XXth Conference of Open Innovations Association FRUCT (Nov 2024)
Automated Speech Act Annotation in a Russian Spoken Corpus Using Large Language Models: A Comparative Study
Abstract
The research focuses on the automatic annotation of a linguistic corpus using large language models (LLMs). Annotating a corpus is a crucial step in its creation, as it determines the practical scope and applications of the resource being developed. This study explores the annotation of oral speech transcripts at the pragmatic level using speech acts that reflect the speaker's intent and purpose. Typically, this task is performed manually by experts, which greatly limits the volume of annotated data that can be produced. In this work, an attempt was made to automatically annotate speech acts using five LLMs commonly used for processing Russian texts – ChatGPT, GigaCHAT, YandexGPT, Mistral, and Gemini. A comparative analysis of the automatic annotation results was conducted, highlighting the strengths and weaknesses of each model. . The findings suggest that employing LLMs for corpus annotation is a promising approach, with ChatGPT and Gemini demonstrating particular effectiveness in speech act categorization. However, for Russian, language-specific models like GigaCHAT and YandexGPT are preferred when language-specific information is needed.
Keywords