===== A simple convergence proof of Adam and Adagrad =====

//Abstract//:
We provide a simple proof of convergence
covering both the Adam and Adagrad adaptive optimization algorithms when applied
to smooth (possibly non-convex) objective
functions with bounded gradients. We show
that in expectation, the squared norm of the
objective gradient averaged over the trajectory has an upper-bound which is explicit
in the constants of the problem, parameters
of the optimizer and the total number of iterations N. This bound can be made arbitrarily small: Adam with a learning rate $\alpha=1/\sqrt{N}$
and a momentum parameter on
squared gradients $\beta_2 = 1 − 1/N$ achieves
the same rate of convergence $O(\ln(N)/\sqrt{N})$
as Adagrad. Finally, we obtain the tightest dependency on the heavy ball momentum
among all previous convergence bounds for
non-convex Adam and Adagrad, improving
from $O((1 − \beta_1)^{−3})$ to $O((1 − \beta_1)^{-1})$.
Our technique also improves the best known dependency for standard SGD by a factor $1−β_1$.

<box 99% green>
Alexandre Défossez, Léon Bottou, Francis Bach and Nicolas Usunier:  **A simple convergence proof of Adam and Adagrad**,  //arXiv preprint arXiv:2003.02395//, 2020.

[[http://leon.bottou.org/publications/djvu/defossez-2020.djvu|defossez-2020.djvu]]
[[http://leon.bottou.org/publications/pdf/defossez-2020.pdf|defossez-2020.pdf]]
[[http://leon.bottou.org/publications/psgz/defossez-2020.ps.gz|defossez-2020.ps.gz]]
</box>

  @article{defossez-2020,
    title = {A simple convergence proof of {Adam} and {Adagrad}},
    author = {D{\'e}fossez, Alexandre and Bottou, L{\'e}on and Bach, Francis and Usunier, Nicolas},
    journal = {arXiv preprint arXiv:2003.02395},
    year = {2020},
    url = {http://leon.bottou.org/papers/defossez-2020},
  }