papers:bottou-geometry-2018

Geometrical Insights for Implicit Generative Modeling

Abstract: Learning algorithms for implicit generative models can optimize a variety of criteria that measure how the data distribution differs from the implicit model distribution, including the Wasserstein distance, the Energy distance, and the Maximum Mean Discrepancy criterion. A careful look at the geometries induced by these distances on the space of probability measures reveals interesting differences. In particular, we can establish surprising approximate global convergence guarantees for the 1-Wasserstein distance, even when the parametric generator has a nonconvex parametrization.

Léon Bottou, Martin Arjovsky, David Lopez-Paz and Maxime Oquab: Geometrical Insights for Implicit Generative Modeling, Braverman Readings in Machine Learning: Key Ideas from Inception to Current State, 229–268, Edited by Ilya Muchnik Lev Rozonoer, Boris Mirkin, LNAI Vol. 11100, Springer, 2018.

geometry-2018.djvu geometry-2018.pdf geometry-2018.ps.gz

@incollection{bottou-geometry-2018,
  author = {Bottou, L{\'e}on and Arjovsky, Martin and Lopez-Paz, David and Oquab, Maxime},
  title = {Geometrical Insights for Implicit Generative Modeling},
  booktitle = {Braverman Readings in Machine Learning: Key Ideas from Inception to Current State},
  editor = {Lev Rozonoer, Boris Mirkin, Ilya Muchnik},
  series = {LNAI Vol. 11100},
  publisher = {Springer},
  year = {2018},
  pages = {229--268},
  url = {http://leon.bottou.org/papers/bottou-geometry-2018},
}

Erratum

Just before section 6.2. the paper claims

One particularly striking aspect of this result is that it does not depend on the parametrization of the family $F$. Whether the cost function $C(\theta) = f(G_\theta\small{\#\mu_z})$ is convex or not is irrelevant: as long as the family $F$ and the cost function $f$ are convex with respect to a well-chosen set of curves, the level sets of the cost function $C(\theta)$ will be connected, and there will be a nonincreasing path connecting any starting point $\theta_0$ to a global optimum $\theta^*$.

This is only true when the parametrization is itself continuous with respect to the distance between induced distributions. This property is not necessarily easy to establish.