Unveiling the Linguistic Capabilities of a Self-Supervised Speech Model Through Cross-Lingual Benchmark and Layer- Wise Similarity Analysis

Takanori Ashihara; Marc Delcroix; Yusuke Ijima; Makio Kashino

doi:10.1109/ACCESS.2024.3428364

IEEE Access (Jan 2024)

Unveiling the Linguistic Capabilities of a Self-Supervised Speech Model Through Cross-Lingual Benchmark and Layer- Wise Similarity Analysis

Takanori Ashihara,
Marc Delcroix,
Yusuke Ijima,
Makio Kashino

Affiliations

Takanori Ashihara: ORCiD; NTT Corporation, Yokosuka-shi, Kanagawa-ken, Japan
Marc Delcroix: ORCiD; NTT Corporation, Yokosuka-shi, Kanagawa-ken, Japan
Yusuke Ijima: NTT Corporation, Yokosuka-shi, Kanagawa-ken, Japan
Makio Kashino: ORCiD; NTT Corporation, Yokosuka-shi, Kanagawa-ken, Japan

DOI: https://doi.org/10.1109/ACCESS.2024.3428364
Journal volume & issue: Vol. 12
pp. 98835 – 98855

Abstract

Read online

Self-supervised learning (SSL), an unsupervised representation learning technique, has received widespread attention across various modalities. Speech, with its inherent complexity encompassing acoustic (e.g., speaker, phoneme, and paralinguistic cues) and linguistic (e.g., words, semantics, and syntax) aspects, prompts a fundamental question: how well can speech SSL models capture linguistic knowledge solely from speech data? This study comprehensively analyzes off-the-shelf SSL models utilizing three methods: probing tasks, layer contribution examinations, and layer-wise similarity analysis. For the probing task, to elucidate cross-lingual conditions, we introduce SpeechGLUE/SpeechJGLUE, the speech version of General Language Understanding Evaluation (GLUE) and its Japanese variant (JGLUE), both of which comprise diverse natural language understanding tasks. The probing system incorporates a weighted sum with trainable weights of all SSL layers’ outputs into downstream models, offering insight into which layer predominantly contributes to addressing tasks. The results reveal that speech SSL models can encode linguistic information, albeit less sophisticated information than with text SSL models. Moreover, later layers are mainly utilized to tackle the benchmark tasks. To highlight their primary linguistic encoding role, we call them linguistic encoding layers (LELs). However, in cross-lingual scenarios, e.g., assessing English SSL models on SpeechJGLUE, the layer contributions equalize, suggesting challenges in determining suitable layers or relying on diverse cues. Nevertheless, some English SSL models can outperform Japanese models on SpeechJGLUE, implying their robustness against language variation. Similarity analysis reveals a block structure within LELs, particularly evident in English WavLM, where the structure becomes unclear with non-English/noise input, reaffirming the presence of LELs.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords