A comprehensive multimodal dataset for contactless lip reading and acoustic analysis

Yao Ge; Chong Tang; Haobo Li; Zikang Chen; Jingyan Wang; Wenda Li; Jonathan Cooper; Kevin Chetty; Daniele Faccio; Muhammad Imran; Qammer H. Abbasi

doi:10.1038/s41597-023-02793-w

Scientific Data (Dec 2023)

A comprehensive multimodal dataset for contactless lip reading and acoustic analysis

Yao Ge,
Chong Tang,
Haobo Li,
Zikang Chen,
Jingyan Wang,
Wenda Li,
Jonathan Cooper,
Kevin Chetty,
Daniele Faccio,
Muhammad Imran,
Qammer H. Abbasi

Affiliations

Yao Ge: James Watt School of Engineering, University of Glasgow
Chong Tang: James Watt School of Engineering, University of Glasgow
Haobo Li: School of Physics & Astronomy, University of Glasgow
Zikang Chen: James Watt School of Engineering, University of Glasgow
Jingyan Wang: James Watt School of Engineering, University of Glasgow
Wenda Li: School of Science and Engineering, University of Dundee
Jonathan Cooper: James Watt School of Engineering, University of Glasgow
Kevin Chetty: Department of Security and Crime Science, University College London
Daniele Faccio: School of Physics & Astronomy, University of Glasgow
Muhammad Imran: James Watt School of Engineering, University of Glasgow
Qammer H. Abbasi: James Watt School of Engineering, University of Glasgow

DOI: https://doi.org/10.1038/s41597-023-02793-w
Journal volume & issue: Vol. 10, no. 1
pp. 1 – 17

Abstract

Read online

Abstract Small-scale motion detection using non-invasive remote sensing techniques has recently garnered significant interest in the field of speech recognition. Our dataset paper aims to facilitate the enhancement and restoration of speech information from diverse data sources for speakers. In this paper, we introduce a novel multimodal dataset based on Radio Frequency, visual, text, audio, laser and lip landmark information, also called RVTALL. Specifically, the dataset consists of 7.5 GHz Channel Impulse Response (CIR) data from ultra-wideband (UWB) radars, 77 GHz frequency modulated continuous wave (FMCW) data from millimeter wave (mmWave) radar, visual and audio information, lip landmarks and laser data, offering a unique multimodal approach to speech recognition research. Meanwhile, a depth camera is adopted to record the landmarks of the subject’s lip and voice. Approximately 400 minutes of annotated speech profiles are provided, which are collected from 20 participants speaking 5 vowels, 15 words, and 16 sentences. The dataset has been validated and has potential for the investigation of lip reading and multimodal speech recognition.

Published in Scientific Data

ISSN: 2052-4463 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Science
Website: https://www.nature.com/sdata/

About the journal