Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets

Akbayan Bekarystankyzy; Orken Mamyrbayev; Mateus Mendes; Anar Fazylzhanova; Muhammad Assam

doi:10.1038/s41598-024-64848-1

Scientific Reports (Jun 2024)

Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets

Akbayan Bekarystankyzy,
Orken Mamyrbayev,
Mateus Mendes,
Anar Fazylzhanova,
Muhammad Assam

Affiliations

Akbayan Bekarystankyzy: Satbayev University
Orken Mamyrbayev: Institute of Information and Computational Technologies
Mateus Mendes: Polytechnic Institute of Coimbra, ISEC
Anar Fazylzhanova: Committee of Science of the Ministry of Science and Higher Education of the RK, Institute of Linguistics and Named After Akhmet Baitursynuly
Muhammad Assam: University of Science and Technology

DOI: https://doi.org/10.1038/s41598-024-64848-1
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 10

Abstract

Read online

Abstract To obtain a reliable and accurate automatic speech recognition (ASR) machine learning model, it is necessary to have sufficient audio data transcribed, for training. Many languages in the world, especially the agglutinative languages of the Turkic family, suffer from a lack of this type of data. Many studies have been conducted in order to obtain better models for low-resource languages, using different approaches. The most popular approaches include multilingual training and transfer learning. In this study, we combined five agglutinative languages from the Turkic family—Kazakh, Bashkir, Kyrgyz, Sakha, and Tatar,—in order to provide multilingual training using connectionist temporal classification and an attention mechanism including a language model, because these languages have cognate words, sentence formation rules, and alphabet (Cyrillic). Data from the open-source database Common voice was used for the study, to make the experiments reproducible. The results of the experiments showed that multilingual training could improve ASR performances for all languages included in the experiment, except Bashkir language. A dramatic result was achieved for the Kyrgyz language: word error rate decreased to nearly one-fifth and character error rate decreased to one-fourth, which proves that this approach can be helpful for critically low-resource languages.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal

Abstract

Keywords