KMSAV: Korean multi-speaker spontaneous audiovisual dataset

Kiyoung Park; Changhan Oh; Sunghee Dong

doi:10.4218/etrij.2023-0352

ETRI Journal (Feb 2024)

KMSAV: Korean multi-speaker spontaneous audiovisual dataset

Kiyoung Park,
Changhan Oh,
Sunghee Dong

Affiliations

Kiyoung Park
Changhan Oh
Sunghee Dong

DOI: https://doi.org/10.4218/etrij.2023-0352
Journal volume & issue: Vol. 46, no. 1
pp. 71 – 81

Abstract

Read online

Recent advances in deep learning for speech and visual recognition have accelerated the development of multimodal speech recognition, yielding many innovative results. We introduce a Korean audiovisual speech recognition corpus. This dataset comprises approximately 150 h of manually transcribed and annotated audiovisual data supplemented with additional 2000 h of untranscribed videos collected from YouTube under the Creative Commons License. The dataset is intended to be freely accessible for unrestricted research purposes. Along with the corpus, we propose an open-source framework for automatic speech recognition (ASR) and audiovisual speech recognition (AVSR). We validate the effectiveness of the corpus with evaluations using state-of-theart ASR and AVSR techniques, capitalizing on both pretrained models and fine-tuning processes. After fine-tuning, ASR and AVSR achieve character error rates of 11.1% and 18.9%, respectively. This error difference highlights the need for improvement in AVSR techniques. We expect that our corpus will be an instrumental resource to support improvements in AVSR.

Published in ETRI Journal

ISSN: 1225-6463 (Print); 2233-7326 (Online)
Publisher: Electronics and Telecommunications Research Institute (ETRI)
Country of publisher: Korea, Republic of
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Telecommunication; Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics
Website: https://onlinelibrary.wiley.com/journal/22337326

About the journal

Abstract

Keywords