Table of Contents
Geometrical Insights for Implicit Generative Modeling
Abstract: Learning algorithms for implicit generative models can optimize a variety of criteria that measure how the data distribution differs from the implicit model distribution, including the Wasserstein distance, the Energy distance, and the Maximum Mean Discrepancy criterion. A careful look at the geometries induced by these distances on the space of probability measures reveals interesting differences. In particular, we can establish surprising approximate global convergence guarantees for the 1-Wasserstein distance, even when the parametric generator has a nonconvex parametrization.
@incollection{bottou-geometry-2018,
author = {Bottou, L{\'e}on and Arjovsky, Martin and Lopez-Paz, David and Oquab, Maxime},
title = {Geometrical Insights for Implicit Generative Modeling},
booktitle = {Braverman Readings in Machine Learning: Key Ideas from Inception to Current State},
editor = {Lev Rozonoer, Boris Mirkin, Ilya Muchnik},
series = {LNAI Vol. 11100},
publisher = {Springer},
year = {2018},
pages = {229--268},
url = {http://leon.bottou.org/papers/bottou-geometry-2018},
}
Erratum 1
Just before section 6.2. the paper claims
One particularly striking aspect of this result is that it does not depend on the parametrization of the family $\mathcal{F}$. Whether the cost function $C(\theta) = f(G_\theta\small{\#\mu_z})$ is convex or not is irrelevant: as long as the family $F$ and the cost function $f$ are convex with respect to a well-chosen set of curves, the level sets of the cost function $C(\theta)$ will be connected, and there will be a nonincreasing path connecting any starting point $\theta_0$ to a global optimum $\theta^*$.
Just like the curves that define our notion of convexity, this non-increasing path is a continuous curve in the space of distributions $\mathcal{F}\subset\mathcal{P_X}$ equipped with its own topology. However this does not mean that this path corresponds to a continuous curve in the parameter space, for instance because the parametrization is non-injective. Ruling out such a scenario is far from simple.
Erratum 2
The constant factors between negative definite kernel $d(x,y)$ and positive definite kernels $K(x,y)$ are mixed up. The simplest fix consists of eliminating the $1/2$ factor in equation (19),
\[ K_d(x,y) \stackrel{\Delta}{=} d(x,x_0) + d(y,y_0) - d(x,y)~, \]
This change makes theorem 2.17 work as written. As a consequence, the factor $2$ in the proof of proposition 2.16 goes away and one must insert a $1/2$ factor in the definition
\[ d_K(x,y) \stackrel{\Delta}{=} \frac12 \| \Phi_x-\Phi_y \|^2_{\mathcal{H}} = \frac12 K(x,x) + \frac12 K(y,y) - K(x,y) \]
to ensure that $d_{K_d}=d$ and make equation (22) work. None of this changes the conclusion.
