*Ann. Math. Statistics*25(3):463–483, 1954

*Excerpt*:
The main point of this paper is to show that, in situations where the supply
of training samples is essentially unlimited, a well designed on-line
algorithm converges toward the minimum of the *expected cost* just as fast
as any batch algorithm. In those situations, the convergence speed is mainly
limited by the fact that some informative examples have not yet been seen
rather than by the fact that the examples already seen have not been fully
exploited by the minimization process.

Léon Bottou and Yann LeCun: **On-line Learning for Very Large Datasets**, *Applied Stochastic Models in Business and Industry*, 21(2):137-151, 2005.

@article{bottou-lecun-2004a, author = {Bottou, L\'{e}on and {LeCun}, Yann}, title = {On-line Learning for Very Large Datasets}, journal = {Applied Stochastic Models in Business and Industry}, year = {2005}, volume = {21}, number = {2}, pages = {137-151}, url = {http://leon.bottou.org/papers/bottou-lecun-2004a}, }

The ASMB version of this work contains the complete proof. The NIPS version (Bottou and LeCun, 2004) was written several months after the ASMB. It contains only a proof sketch but reports experimental results. It also offers a better discussion of the previous results obtained by Murata and Amari (1998) which are in fact much more general than I initially realized. Relative to these results, this work contains three contributions: (a) spelling out the computational consequences of the efficiency results, (b) expressing the solution of the batch learning problems as a stochastic process that is fundamentally similar to a second order stochastic gradient algorithm, and (c) providing a more rigorous proof, taking into account, for instance, how the gain matrix approximates the inverse Hessian.

The core of the proof is a very interesting lemma with a rather complex proof. I did now know at the time that this lemma is in fact an extension of Chung's lemma^{1)}: whereas Chung's lemma gives asymptotic bounds for sequences defined by the recursion

\[ u_{t+1} = (1-\frac{\alpha}{t}) u_t + \frac{\beta}{t^{p+1}} \,,\]

our lemma gives an asymptotic equivalent for sequences defined by a recursion

\[ u_{t+1} = \left(1-\frac{\alpha}{t}+ o\left(\frac{1}{t}\right)\right) u_t + \frac{\beta}{t^2} + o\left(\frac{1}{t^2}\right) \]

that contains unspecified lower order terms. We only give the result for $\alpha>1$ (which in fact is the more complex case.)

Chung K. L. : On Stochastic Approximations, *Ann. Math. Statistics* 25(3):463–483, 1954