IEEE Access (Jan 2024)

Efficient Real-Time Smart Keyword Spotting Using Spectrogram-Based Hybrid CNN-LSTM for Edge System

  • Infall Syafalni,
  • Clarence Amadeus,
  • Nana Sutisna,
  • Trio Adiono

DOI
https://doi.org/10.1109/ACCESS.2024.3380350
Journal volume & issue
Vol. 12
pp. 43109 – 43125

Abstract

Read online

Keyword Spotting (KWS) is the task of recognizing spoken command words from a database. With recent application human-machine interactions, KWS systems require real-time performance, where edge computing is a preferable option. To allow KWS systems to work on fast and real-time implementation, a low-complexity yet high-accurate AI model is mandatory. In this paper, we propose a comprehensive voice command recognition system design and its hardware implementation. The proposed AI model considered in this system is SpectroNet-based and an efficient hybrid CNN-LSTM architecture with low complexity. Jetson Xavier NX is an edge device because of its strong computational power as an embedded device. The implementation result shows the proposed method offers quite good in terms of accuracy, indicated by no accuracy drop between the model implemented in PC and Jetson Xavier. However, the inference time is quite high, which is 180 ms/step. To improve the speed of the system, the TensorRT library is used to further optimize the model. Optimization of the model is found effective, reducing 59.35% of the total operation performed in SpectroNet when FP32 precision is used, and 59.63% when FP16 precision is used. The model is also sped up by 45% if FP32 precision mode is used and 62% if FP16 precision mode is used. However, there is a slight accuracy drop of 2.68% if FP32 precision mode is used and 4.84% if FP16 precision mode is used. This slight drop in accuracy is considered negligible compared to the performance boost that TensorRT gives. The work is useful for intelligent control systems such as smart vehicles, smartphones, computers, and smart communications.

Keywords