IEEE Access (Jan 2024)

Analysis and Enhancement of Resilience for LSTM Accelerators Using Residue-Based CEDs

  • Nooshin Nosrati,
  • Zainalabedin Navabi

DOI
https://doi.org/10.1109/ACCESS.2024.3386431
Journal volume & issue
Vol. 12
pp. 52851 – 52866

Abstract

Read online

As Long Short-Term Memory (LSTM) accelerators are increasingly being employed in safety-critical applications with high-reliability demands, protecting them against errors becomes imperative. Traditional protection techniques for LSTMs are either costly, conflict with strict area and power constraints for neural network accelerators, or introduce performance overhead that is untenable for real-time and latency-critical accelerators. In this paper, we propose residue-based Concurrent Error Detection (CED) schemes to detect transient faults and alleviate their impact during LSTM computations. CED units are employed in a coarse-grain or fine-grain fashion, depending on the granularity of the components being protected from the whole LSTM computations to individual processing elements. In pursuit of a more cost-effective strategy, we use fine-grain CEDs in a selective manner based on the resilience characteristics of LSTM. The selections are made spatially for LSTM synaptic weights or temporally based on LSTM time steps. The experimental results show that the proposed residue-based CEDs (i) can achieve nearly complete fault coverage even under extremely large bit error rates, (ii) significantly decrease misprediction rates compared to the unprotected LSTM, and (iii) incur low overhead without compromising performance. Our method is compared with modular redundancy techniques such as DMR and TMR (Double and Triple Modular Redundancy).

Keywords