Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation

Ke-Ming Lyu; Ren-yuan Lyu; Hsien-Tsung Chang

doi:10.7717/peerj-cs.1973

PeerJ Computer Science (Mar 2024)

Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation

Ke-Ming Lyu,
Ren-yuan Lyu,
Hsien-Tsung Chang

Affiliations

Ke-Ming Lyu: Computer Science and Information Engineering, Chang Gung University, Taoyuan, Taiwan
Ren-yuan Lyu: Computer Science and Information Engineering, Chang Gung University, Taoyuan, Taiwan
Hsien-Tsung Chang: Computer Science and Information Engineering, Chang Gung University, Taoyuan, Taiwan

DOI: https://doi.org/10.7717/peerj-cs.1973
Journal volume & issue: Vol. 10
p. e1973

Abstract

Read online Read online

This research presents the development of a cutting-edge real-time multilingual speech recognition and speaker diarization system that leverages OpenAI’s Whisper model. The system specifically addresses the challenges of automatic speech recognition (ASR) and speaker diarization (SD) in dynamic, multispeaker environments, with a focus on accurately processing Mandarin speech with Taiwanese accents and managing frequent speaker switches. Traditional speech recognition systems often fall short in such complex multilingual and multispeaker contexts, particularly in SD. This study, therefore, integrates advanced speech recognition with speaker diarization techniques optimized for real-time applications. These optimizations include handling model outputs efficiently and incorporating speaker embedding technology. The system was evaluated using data from Taiwanese talk shows and political commentary programs, featuring 46 diverse speakers. The results showed a promising word diarization error rate (WDER) of 2.68% in two-speaker scenarios and 11.65% in three-speaker scenarios, with an overall WDER of 6.96%. This performance is comparable to that of non-real-time baseline models, highlighting the system’s ability to adapt to various complex conversational dynamics, a significant advancement in the field of real-time multilingual speech processing.

Published in PeerJ Computer Science

ISSN: 2376-5992 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://peerj.com/computer-science/

About the journal

Abstract

Keywords