Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximizes the utilization of available hardware by avoiding waiting for slow workers. However, with constant step sizes, the convergence of AS...

Read Original Article →

Source

http://arxiv.org/abs/2606.13287v1