Split-Attention CNN and Self-Attention With RoPE and GCN for Voice Activity Detection

Yingwei Tan; Xuefeng Ding

doi:10.1109/ACCESS.2024.3486003

IEEE Access (Jan 2024)

Split-Attention CNN and Self-Attention With RoPE and GCN for Voice Activity Detection

Yingwei Tan,
Xuefeng Ding

Affiliations

Yingwei Tan: ORCiD; Volkswagen-Mobvoi (Beijing) Information Technology Company Ltd., Beijing, China
Xuefeng Ding: ORCiD; Volkswagen-Mobvoi (Beijing) Information Technology Company Ltd., Beijing, China

DOI: https://doi.org/10.1109/ACCESS.2024.3486003
Journal volume & issue: Vol. 12
pp. 156673 – 156682

Abstract

Read online

In recent years, attention-based voice activity detection systems have become popular, attributed to their ability to encapsulate a diverse array of contextual information. The integration of multi-head attention and position embedding within the attention architecture holds pivotal importance. The employment of multiple attention heads enables a differential emphasis on distinct segments of the sequence, whereas position embedding offers crucial guidance in modeling the dependencies among elements occupying various positions within the input sequence. In this work, we propose a new hybrid architecture for voice activity detection incorporating both split-attention convolutional neural network and self-attention layers with rotary position embedding and graph convolutional networks, trained in an end-to-end manner. Firstly, for enhancing the learning of local features, we introduce channel-wise attention across various branches of a convolutional neural network to capitalize on their proficiency in capturing cross-feature interactions and learning diverse representations. Furthermore, in order to better learn global features, we present a novel approach that treats each attention head as a node, enabling the utilization of graph convolutional networks to identify correlations among these attention heads. Lastly, for learning relative position information, we employ a cutting-edge implementation named rotary position embedding, which encodes absolute positional information into the input sequence via a rotation matrix, seamlessly integrating explicit relative position information into a self-attention module. To assess the effectiveness of our method, we conduct experiments on synthetic voice activity detection datasets, AVA-speech datasets, and Kaggle voice activity detection datasets. The results obtained highlight the superiority of our method over baseline systems across various noise conditions.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords