End‐to‐end deep learning classification of vocal pathology using stacked vowels

George S. Liu; Jordan M. Hodges; Jingzhi Yu; C. Kwang Sung; Elizabeth Erickson‐DiRenzo; Philip C. Doyle

doi:10.1002/lio2.1144

Laryngoscope Investigative Otolaryngology (Oct 2023)

End‐to‐end deep learning classification of vocal pathology using stacked vowels

George S. Liu,
Jordan M. Hodges,
Jingzhi Yu,
C. Kwang Sung,
Elizabeth Erickson‐DiRenzo,
Philip C. Doyle

Affiliations

George S. Liu: Department of Otolaryngology Head and Neck Surgery Stanford University School of Medicine, Stanford University Stanford California USA
Jordan M. Hodges: Computer Science Department School of Engineering, Stanford University Stanford California USA
Jingzhi Yu: Biomedical Informatics, Department of Biomedical Data Science Stanford University School of Medicine Stanford California USA
C. Kwang Sung: Department of Otolaryngology Head and Neck Surgery Stanford University School of Medicine, Stanford University Stanford California USA
Elizabeth Erickson‐DiRenzo: Department of Otolaryngology Head and Neck Surgery Stanford University School of Medicine, Stanford University Stanford California USA
Philip C. Doyle: Department of Otolaryngology Head and Neck Surgery Stanford University School of Medicine, Stanford University Stanford California USA

DOI: https://doi.org/10.1002/lio2.1144
Journal volume & issue: Vol. 8, no. 5
pp. 1312 – 1318

Abstract

Read online

Abstract Objectives Advances in artificial intelligence (AI) technology have increased the feasibility of classifying voice disorders using voice recordings as a screening tool. This work develops upon previous models that take in single vowel recordings by analyzing multiple vowel recordings simultaneously to enhance prediction of vocal pathology. Methods Voice samples from the Saarbruecken Voice Database, including three sustained vowels (/a/, /i/, /u/) from 687 healthy human participants and 334 dysphonic patients, were used to train 1‐dimensional convolutional neural network models for multiclass classification of healthy, hyperfunctional dysphonia, and laryngitis voice recordings. Three models were trained: (1) a baseline model that analyzed individual vowels in isolation, (2) a stacked vowel model that analyzed three vowels (/a/, /i/, /u/) in the neutral pitch simultaneously, and (3) a stacked pitch model that analyzed the /a/ vowel in three pitches (low, neutral, and high) simultaneously. Results For multiclass classification of healthy, hyperfunctional dysphonia, and laryngitis voice recordings, the stacked vowel model demonstrated higher performance compared with the baseline and stacked pitch models (F1 score 0.81 vs. 0.77 and 0.78, respectively). Specifically, the stacked vowel model achieved higher performance for class‐specific classification of hyperfunctional dysphonia voice samples compared with the baseline and stacked pitch models (F1 score 0.56 vs. 0.49 and 0.50, respectively). Conclusions This study demonstrates the feasibility and potential of analyzing multiple sustained vowel recordings simultaneously to improve AI‐driven screening and classification of vocal pathology. The stacked vowel model architecture in particular offers promise to enhance such an approach. Lay Summary AI analysis of multiple vowel recordings can improve classification of voice pathologies compared with models using a single sustained vowel and offer a strategy to enhance AI‐driven screening of voice disorders. Level of Evidence 3

Published in Laryngoscope Investigative Otolaryngology

ISSN: 2378-8038 (Online)
Publisher: Wiley
Country of publisher: United States
LCC subjects: Medicine: Otorhinolaryngology; Medicine: Surgery
Website: https://onlinelibrary.wiley.com/journal/23788038

About the journal

Abstract

Keywords