Frontiers in Physics (Jul 2024)

IntervoxNet: a novel dual-modal audio-text fusion network for automatic and efficient depression detection from interviews

  • Huijun Ding,
  • Huijun Ding,
  • Zhou Du,
  • Ziwei Wang,
  • Junqi Xue,
  • Zhaoguo Wei,
  • Zhaoguo Wei,
  • Kongjun Yang,
  • Kongjun Yang,
  • Shan Jin,
  • Shan Jin,
  • Zhiguo Zhang,
  • Zhiguo Zhang,
  • Jianhong Wang,
  • Jianhong Wang

DOI
https://doi.org/10.3389/fphy.2024.1430035
Journal volume & issue
Vol. 12

Abstract

Read online

Depression is a prevalent mental health problem across the globe, presenting significant social and economic challenges. Early detection and treatment are pivotal in reducing these impacts and improving patient outcomes. Traditional diagnostic methods largely rely on subjective assessments by psychiatrists, underscoring the importance of developing automated and objective diagnostic tools. This paper presents IntervoxNet, a novel computeraided detection system designed specifically for analyzing interview audio. IntervoxNet incorporates a dual-modal approach, utilizing both the Audio Mel-Spectrogram Transformer (AMST) for audio processing and a hybrid model combining Bidirectional Encoder Representations from Transformers with a Convolutional Neural Network (BERT-CNN) for text analysis. Evaluated on the DAIC-WOZ database, IntervoxNet demonstrates excellent performance, achieving F1 score, recall, precision, and accuracy of 0.90, 0.92, 0.88, and 0.86 respectively, thereby surpassing existing state of the art methods. These results demonstrate IntervoxNet’s potential as a highly effective and efficient tool for rapid depression screening in interview settings.

Keywords