IEEE Access (Jan 2024)

Unveiling the Linguistic Capabilities of a Self-Supervised Speech Model Through Cross-Lingual Benchmark and Layer- Wise Similarity Analysis

  • Takanori Ashihara,
  • Marc Delcroix,
  • Yusuke Ijima,
  • Makio Kashino

DOI
https://doi.org/10.1109/ACCESS.2024.3428364
Journal volume & issue
Vol. 12
pp. 98835 – 98855

Abstract

Read online

Self-supervised learning (SSL), an unsupervised representation learning technique, has received widespread attention across various modalities. Speech, with its inherent complexity encompassing acoustic (e.g., speaker, phoneme, and paralinguistic cues) and linguistic (e.g., words, semantics, and syntax) aspects, prompts a fundamental question: how well can speech SSL models capture linguistic knowledge solely from speech data? This study comprehensively analyzes off-the-shelf SSL models utilizing three methods: probing tasks, layer contribution examinations, and layer-wise similarity analysis. For the probing task, to elucidate cross-lingual conditions, we introduce SpeechGLUE/SpeechJGLUE, the speech version of General Language Understanding Evaluation (GLUE) and its Japanese variant (JGLUE), both of which comprise diverse natural language understanding tasks. The probing system incorporates a weighted sum with trainable weights of all SSL layers’ outputs into downstream models, offering insight into which layer predominantly contributes to addressing tasks. The results reveal that speech SSL models can encode linguistic information, albeit less sophisticated information than with text SSL models. Moreover, later layers are mainly utilized to tackle the benchmark tasks. To highlight their primary linguistic encoding role, we call them linguistic encoding layers (LELs). However, in cross-lingual scenarios, e.g., assessing English SSL models on SpeechJGLUE, the layer contributions equalize, suggesting challenges in determining suitable layers or relying on diverse cues. Nevertheless, some English SSL models can outperform Japanese models on SpeechJGLUE, implying their robustness against language variation. Similarity analysis reveals a block structure within LELs, particularly evident in English WavLM, where the structure becomes unclear with non-English/noise input, reaffirming the presence of LELs.

Keywords