Segment boundary detection directed attention for online end-to-end speech recognition

Junfeng Hou; Wu Guo; Yan Song; Li-Rong Dai

doi:10.1186/s13636-020-0170-z

EURASIP Journal on Audio, Speech, and Music Processing (Jan 2020)

Segment boundary detection directed attention for online end-to-end speech recognition

Junfeng Hou,
Wu Guo,
Yan Song,
Li-Rong Dai

Affiliations

Junfeng Hou: National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China
Wu Guo: National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China
Yan Song: National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China
Li-Rong Dai: National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China

DOI: https://doi.org/10.1186/s13636-020-0170-z
Journal volume & issue: Vol. 2020, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Attention-based encoder-decoder models have recently shown competitive performance for automatic speech recognition (ASR) compared to conventional ASR systems. However, how to employ attention models for online speech recognition still needs to be explored. Different from conventional attention models wherein the soft alignment is obtained by a pass over the entire input sequence, attention models for online recognition must learn online alignment to attend part of input sequence monotonically when generating output symbols. Based on the fact that every output symbol is corresponding to a segment of input sequence, we propose a new attention mechanism for learning online alignment by decomposing the conventional alignment into two parts: segmentation—segment boundary detection with hard decision—and segment-directed attention—information aggregation within the segment with soft attention. The boundary detection is conducted along the time axis from left to right, and a decision is made for each input frame about whether it is a segment boundary or not. When a boundary is detected, the decoder generates an output symbol by attending the inputs within the corresponding segment. With the proposed attention mechanism, online speech recognition can be realized. The experimental results on TIMIT and WSJ dataset show that our proposed attention mechanism achieves comparable online performance with state-of-the-art models.

Published in EURASIP Journal on Audio, Speech, and Music Processing

ISSN: 1687-4722 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Science: Physics: Acoustics. Sound; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://asmp-eurasipjournals.springeropen.com

About the journal

Abstract

Keywords