Pilar Nusa Mandiri (Mar 2024)
ANALYSIS OF WHISPER AUTOMATIC SPEECH RECOGNITION PERFORMANCE ON LOW RESOURCE LANGUAGE
Abstract
Implementing Automatic Speech Recognition Technology in daily life could give convenience to its users. However, speeches that can be recognized accurately by the ASR model right now are in languages considered high resources, like English. In previous research, a few regional languages like Javanese, Sundanese, Balinese and Btaknese are used in automatic speech recognition. This research aim is to improve speech recognition using the ASR model on low-resource language. The dataset used in this research is the Javanese dataset specifically because there is a high-quality Javanese speech dataset provided by previous research. The method used is fine-tuning the Whisper model which has been trained on 680,000 hours of multilingual voice data using a Javanese speech dataset. To reduce computation requirements, parameter efficient fine-tuning (PEFT) implemented in the fine-tuning process. The trainable parameter is reduced to <1% because the implementation of PEFT reduces the computation required by the model for fine-tuning. The best WER evaluation result is 13.77%, achieved by the fine-tuned Whisper large-v2 model compared to the base model of Whisper large-v2, which achieves 89.40% in WER evaluation. Performance improvement in WER evaluation showed that fine-tuning effectively improves the performance of the Whisper automatic speech recognition model on recognizing speeches in low-resource languages like the Javanese language compared to the Original Whisper model performance with minimal computational cost needed for fine-tuning large model.
Keywords