IEEE Access (Jan 2024)

Split-Attention CNN and Self-Attention With RoPE and GCN for Voice Activity Detection

  • Yingwei Tan,
  • Xuefeng Ding

DOI
https://doi.org/10.1109/ACCESS.2024.3486003
Journal volume & issue
Vol. 12
pp. 156673 – 156682

Abstract

Read online

In recent years, attention-based voice activity detection systems have become popular, attributed to their ability to encapsulate a diverse array of contextual information. The integration of multi-head attention and position embedding within the attention architecture holds pivotal importance. The employment of multiple attention heads enables a differential emphasis on distinct segments of the sequence, whereas position embedding offers crucial guidance in modeling the dependencies among elements occupying various positions within the input sequence. In this work, we propose a new hybrid architecture for voice activity detection incorporating both split-attention convolutional neural network and self-attention layers with rotary position embedding and graph convolutional networks, trained in an end-to-end manner. Firstly, for enhancing the learning of local features, we introduce channel-wise attention across various branches of a convolutional neural network to capitalize on their proficiency in capturing cross-feature interactions and learning diverse representations. Furthermore, in order to better learn global features, we present a novel approach that treats each attention head as a node, enabling the utilization of graph convolutional networks to identify correlations among these attention heads. Lastly, for learning relative position information, we employ a cutting-edge implementation named rotary position embedding, which encodes absolute positional information into the input sequence via a rotation matrix, seamlessly integrating explicit relative position information into a self-attention module. To assess the effectiveness of our method, we conduct experiments on synthetic voice activity detection datasets, AVA-speech datasets, and Kaggle voice activity detection datasets. The results obtained highlight the superiority of our method over baseline systems across various noise conditions.

Keywords