Automated Speech Act Annotation in a Russian Spoken Corpus Using Large Language Models: A Comparative Study

Tatiana Sherstinova; Viktoria Firsanova; Alena Novoseltseva; Mariya Megre; Egor Savchenko

doi:10.5281/zenodo.14166352

Proceedings of the XXth Conference of Open Innovations Association FRUCT (Nov 2024)

Automated Speech Act Annotation in a Russian Spoken Corpus Using Large Language Models: A Comparative Study

Tatiana Sherstinova,
Viktoria Firsanova,
Alena Novoseltseva,
Mariya Megre,
Egor Savchenko

Affiliations

Tatiana Sherstinova: HSE University
Viktoria Firsanova: HSE University
Alena Novoseltseva: HSE University
Mariya Megre: HSE University
Egor Savchenko: HSE University

DOI: https://doi.org/10.5281/zenodo.14166352
Journal volume & issue: Vol. 36, no. 2
pp. 912 – 920

Abstract

Read online

The research focuses on the automatic annotation of a linguistic corpus using large language models (LLMs). Annotating a corpus is a crucial step in its creation, as it determines the practical scope and applications of the resource being developed. This study explores the annotation of oral speech transcripts at the pragmatic level using speech acts that reflect the speaker's intent and purpose. Typically, this task is performed manually by experts, which greatly limits the volume of annotated data that can be produced. In this work, an attempt was made to automatically annotate speech acts using five LLMs commonly used for processing Russian texts – ChatGPT, GigaCHAT, YandexGPT, Mistral, and Gemini. A comparative analysis of the automatic annotation results was conducted, highlighting the strengths and weaknesses of each model. . The findings suggest that employing LLMs for corpus annotation is a promising approach, with ChatGPT and Gemini demonstrating particular effectiveness in speech act categorization. However, for Russian, language-specific models like GigaCHAT and YandexGPT are preferred when language-specific information is needed.

spoken speech pragmatics corpus linguistics speech acts pragmatic annotation llms

Published in Proceedings of the XXth Conference of Open Innovations Association FRUCT

ISSN: 2305-7254 (Print); 2343-0737 (Online)
Publisher: FRUCT
Country of publisher: Finland
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Telecommunication
Website: http://fruct.org/publication

About the journal

Abstract

Keywords