A simple convergence proof of Adam and Adagrad

Abstract: We provide a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients. We show that in expectation, the squared norm of the objective gradient averaged over the trajectory has an upper-bound which is explicit in the constants of the problem, parameters of the optimizer and the total number of iterations N. This bound can be made arbitrarily small: Adam with a learning rate $\alpha=1/\sqrt{N}$ and a momentum parameter on squared gradients $\beta_2 = 1 - 1/N$ achieves the same rate of convergence $O(\ln(N)/\sqrt{N})$ as Adagrad. Finally, we obtain the tightest dependency on the heavy ball momentum among all previous convergence bounds for non-convex Adam and Adagrad, improving from $O((1 - \beta_1)^{-3})$ to $O((1 - \beta_1)^{-1})$. Our technique also improves the best known dependency for standard SGD by a factor $1-β_1$.

Alexandre Défossez, Léon Bottou, Francis Bach and Nicolas Usunier: A simple convergence proof of Adam and Adagrad, arXiv preprint arXiv:2003.02395, 2020.

defossez-2020.djvu defossez-2020.pdf defossez-2020.ps.gz

@article{defossez-2020,
  title = {A simple convergence proof of {Adam} and {Adagrad}},
  author = {D{\'e}fossez, Alexandre and Bottou, L{\'e}on and Bach, Francis and Usunier, Nicolas},
  journal = {arXiv preprint arXiv:2003.02395},
  year = {2020},
  url = {http://leon.bottou.org/papers/defossez-2020},
}