IEEE Access (Jan 2024)
Priority-Encoder Ensemble for Speech Recognition
Abstract
The advancement in computational capabilities and the availability of vast datasets have propelled the performance of Automatic Speech Recognition (ASR) systems. However, the task of ASR is complex, requiring consideration of diverse factors such as spoken tone, intonation, accents, and pitch modulation. To tackle these challenges, ensembles of Large Language Models (LLMs) have emerged as a promising approach, harnessing the strengths of multiple models to improve recognition accuracy. These ensembles, employing various strategies, often encounter significant time requirements during the inference process limiting the applicability in real-life scenarios. In this study, we introduce a novel ensemble strategy, the Priority-Encoder Ensemble (PE-Ensemble), for ASR systems. The PE-Ensemble employs a meta-learning-based Decider model to dynamically select the optimal model from the ensemble for inference, significantly reducing the computational load and memory requirements during inference. Unlike traditional ensembles where all models are loaded into memory, our approach requires only a single model to be loaded, enhancing efficiency in real-world applications such as unmanned kiosks. We evaluate the PE-Ensemble against the commonly used average ensemble strategy and individual base models. The results demonstrate that the PE-Ensemble outperforms both the average ensemble and individual base models in terms of prediction accuracy as well as computational time during inference. This enhancement in accuracy, coupled with the substantial reduction in computational load, highlights the efficacy and practical applicability of the proposed PE-Ensemble approach.
Keywords