*Abstract*:
We provide a simple proof of convergence
covering both the Adam and Adagrad adaptive optimization algorithms when applied
to smooth (possibly non-convex) objective
functions with bounded gradients. We show
that in expectation, the squared norm of the
objective gradient averaged over the trajectory has an upper-bound which is explicit
in the constants of the problem, parameters
of the optimizer and the total number of iterations N. This bound can be made arbitrarily small: Adam with a learning rate $\alpha=1/\sqrt{N}$
and a momentum parameter on
squared gradients $\beta_2 = 1 - 1/N$ achieves
the same rate of convergence $O(\ln(N)/\sqrt{N})$
as Adagrad. Finally, we obtain the tightest dependency on the heavy ball momentum
among all previous convergence bounds for
non-convex Adam and Adagrad, improving
from $O((1 - \beta_1)^{-3})$ to $O((1 - \beta_1)^{-1})$.
Our technique also improves the best known dependency for standard SGD by a factor $1-β_1$.

Alexandre Défossez, Leon Bottou, Francis Bach and Nicolas Usunier: **A Simple Convergence Proof of Adam and Adagrad**, *Transactions on Machine Learning Research*, 2022.

tmlr-defossez-2022.djvu tmlr-defossez-2022.pdf tmlr-defossez-2022.ps.gz

@article{defossez-2022, title = {A Simple Convergence Proof of Adam and Adagrad}, author = {D{\'e}fossez, Alexandre and Bottou, Leon and Bach, Francis and Usunier, Nicolas}, journal = {Transactions on Machine Learning Research}, issn = {2835-8856}, year = {2022}, url = {http://leon.bottou.org/papers/defossez-2022}, }

papers/defossez-2022.txt · Last modified: 2023/08/29 05:54 by leonb