Tongxin xuebao (Jul 2022)

Self-supervised speech representation learning based on positive sample comparison and masking reconstruction

  • Wenlin ZHANG,
  • Xuepeng LIU,
  • Tong NIU,
  • Qi CHEN,
  • Dan QU

Journal volume & issue
Vol. 43
pp. 163 – 171

Abstract

Read online

To solve the problem that existing contrastive prediction based self-supervised speech representation learning methods need to construct a large number of negative samples, and their performance depends on large training batches, requiring a lot of computing resources, a new speech representation learning method based on contrastive learning using only positive samples was proposed.Combined with reconstruction loss, the proposed method could obtain better representation with lower training cost.The proposed method was inspired by the idea of the SimSiam method in image self-supervised representation learning.Using the siamese network architecture, two random augmentations of the input speech signals were processed by the same encoder network, then a feed-forward network was applied on one side, and a stop-gradient operation was applied on the other side.The model was trained to maximize the similarity between two sides.During training processing, negative samples were not required, so small batch size could be used and training efficiency was improved.Experimental results show that the representation model obtained by the new method achieves or exceeds the performance of existing mainstream speech representation learning models in multiple downstream tasks.

Keywords