Introduction
Distributed Training¶
The goal of distributed training of deep learning models—to significantly reduce the training time of deep learning models without degrading their performance.
Pipelines¶
Motivation¶
We consider distributed optimization under communication constraints for training deep learning models. Our method differs from the state-of-art parameter-averaging scheme EASGD in a number of ways:
- objective formulation that does not change the location of stationary points compared to the original optimization problem
- avoiding convergence decelerations caused by pulling local workers descending to different local minima (to the average of their parameters)
- breaking the curse of symmetry - the phenonmenon of being trapped in poorly generalizing sub-optimal solutions in symmetric non-convex landscape
- communication efficiency and alignment with current hardware architecture
Deal with curse of symmetry¶
- From EA(S)GD to L(S)GD
- An illustration of "curse of symmetry"
- An example of a highly non-convex problem
Benchmarks¶
4 workers (1 Server and 4 GTX-1080 on a single server)¶
12 workers (3 Servers and 4 GTX-1080 on a single server)¶
LSGD Paper¶
Paper: Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models
Authors: Yunfei Teng*, Wenbo Gao*, Francois Chalus, Anna Choromanska, Donald Goldfarb, Adrian Weller
Arxiv: https://arxiv.org/abs/1905.10395
Github: https://github.com/yunfei-teng/LSGD
Poster: https://github.com/yunfei-teng/LSGD/blob/master/docs/LSGD_Poster_NeurIPS2019.pdf