Skip to content

Introduction

Distributed Training

The goal of distributed training of deep learning models—to significantly reduce the training time of deep learning models without degrading their performance.

Pipelines

dist

Motivation

We consider distributed optimization under communication constraints for training deep learning models. Our method differs from the state-of-art parameter-averaging scheme EASGD in a number of ways:

  • objective formulation that does not change the location of stationary points compared to the original optimization problem
  • avoiding convergence decelerations caused by pulling local workers descending to different local minima (to the average of their parameters)
  • breaking the curse of symmetry - the phenonmenon of being trapped in poorly generalizing sub-optimal solutions in symmetric non-convex landscape
  • communication efficiency and alignment with current hardware architecture

Deal with curse of symmetry

  • From EA(S)GD to L(S)GD dist3
  • An illustration of "curse of symmetry" dist2
  • An example of a highly non-convex problem dist5

Benchmarks

4 workers (1 Server and 4 GTX-1080 on a single server)

result1

12 workers (3 Servers and 4 GTX-1080 on a single server)

result2

LSGD Paper

Paper: Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models

Authors: Yunfei Teng*, Wenbo Gao*, Francois Chalus, Anna Choromanska, Donald Goldfarb, Adrian Weller

Arxiv: https://arxiv.org/abs/1905.10395

Github: https://github.com/yunfei-teng/LSGD

Poster: https://github.com/yunfei-teng/LSGD/blob/master/docs/LSGD_Poster_NeurIPS2019.pdf

Talk: 14th Annual Machine Learning Symposium Spotlight Talk