Heliyon (Jan 2024)
DPro-SM – A distributed framework for proactive straggler mitigation using LSTM
Abstract
The recent advancement in deep learning with growth in big data and high-performance computing is Distributed Deep Learning. The rapid rise in the volume of data and network complexity has led to significant growth in DDL. Distribution of the network leads to high communication and computation among the nodes, which leads to high training time and lower accuracy. The primary reason for the delay in communication is the presence of straggler nodes which causes the bottleneck in communication. Due to the enormous volume of parameter transfer, Distributed Deep Learning's data parallelism incurs substantial communication costs. The newly developed model-parallel methods may minimize the communication effort; however, this results in load imbalance and severe straggler issues: the proposed model DPro-SM, a distributed framework for proactive straggler mitigation using LSTM in distributed deep learning. DPro-SM uses LSTM to predict the completion time of each worker and proactively allocates resources to reduce the overall training time. The results show that DPro-SM can significantly reduce the training time and improve the scalability and efficiency of large-scale machine learning tasks.