Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting

Yidi Li; Jiale Ren; Yawei Wang; Guoquan Wang; Xia Li; Hong Liu

doi:10.1049/cit2.12212

CAAI Transactions on Intelligence Technology (Feb 2024)

Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting

Yidi Li,
Jiale Ren,
Yawei Wang,
Guoquan Wang,
Xia Li,
Hong Liu

Affiliations

Yidi Li: Key Laboratory of Machine Perception Peking University Shenzhen Graduate School Shenzhen China
Jiale Ren: Key Laboratory of Machine Perception Peking University Shenzhen Graduate School Shenzhen China
Yawei Wang: Key Laboratory of Machine Perception Peking University Shenzhen Graduate School Shenzhen China
Guoquan Wang: Key Laboratory of Machine Perception Peking University Shenzhen Graduate School Shenzhen China
Xia Li: Department of Computer Science ETH Zurich Zurich Switzerland
Hong Liu: Key Laboratory of Machine Perception Peking University Shenzhen Graduate School Shenzhen China

DOI: https://doi.org/10.1049/cit2.12212
Journal volume & issue: Vol. 9, no. 1
pp. 142 – 152

Abstract

Read online

Abstract As one of the most effective methods to improve the accuracy and robustness of speech tasks, the audio–visual fusion approach has recently been introduced into the field of Keyword Spotting (KWS). However, existing audio–visual keyword spotting models are limited to detecting isolated words, while keyword spotting for unconstrained speech is still a challenging problem. To this end, an Audio–Visual Keyword Transformer (AVKT) network is proposed to spot keywords in unconstrained video clips. The authors present a transformer classifier with learnable CLS tokens to extract distinctive keyword features from the variable‐length audio and visual inputs. The outputs of audio and visual branches are combined in a decision fusion module. As humans can easily notice whether a keyword appears in a sentence or not, our AVKT network can detect whether a video clip with a spoken sentence contains a pre‐specified keyword. Moreover, the position of the keyword is localised in the attention map without additional position labels. Experimental results on the LRS2‐KWS dataset and our newly collected PKU‐KWS dataset show that the accuracy of AVKT exceeded 99% in clean scenes and 85% in extremely noisy conditions. The code is available at https://github.com/jialeren/AVKT.

Published in CAAI Transactions on Intelligence Technology

ISSN: 2468-2322 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/24682322

About the journal

Abstract

Keywords