### Table of Contents

# Learning with Stochastic Gradient Descent

## Basics

Many numerical learning algorithms amount
to optimizing a cost function that can be
expressed as an average over the training examples.
The *loss function* measures how well (or how poorly)
the learning system performs on each example.
The *cost function* is then the average of the
loss function measures on all training examples,
possibly augmented with capacity control terms.

Computing such an average takes a time
proportional to the number of examples *n*.
This constant always appears in the running time
of all optimization algorithm that considers the
complete cost function.
*Stochastic gradient descent* instead
updates the learning system on the basis
of the loss function measured for a single example.
Such an algorithm works because the averaged
effect of these updates is the same.
Although the convergence is much more noisy,
the elimination of the constant *n* in the
computing cost can be a huge advantage for
large-scale problems

Stochastic Gradient Descent has been historically associated with back-propagation algorithms in multilayer neural networks. These nonlinear nonconvex problems can be very difficult. Therefore it is useful to see how Stochastic Gradient Descent performs on simple linear and convex problems such as linear Support Vector Machines (SVMs) or Conditional Random Fields (CRFs). Source code and benchmarks are available on the page about Stochastic Gradient Examples. The benchmarks show very clearly the superiority of Stochastic Gradient Descent on large learning problems.

## Research

Many learning algorithms of the early nineties can be expressed as stochastic gradient descent, including perceptrons, adalines, multilayer networks, k-means, learning vector quantization, etc. Studying stochastic optimization allowed me to characterize their convergence and their speed in a unified framework. This theoretical background became a solid basis to study the practical issues surrounding stochastic gradient descent, and for designing learning algorithms for new categories of problems.

## Lectures

- The BackPropagation CookBook, NIPS workshops, Denver, 1996

- Online Learning and Stochastic Approximations, Machine Learning Summer School, Tübingen, 2003

- The Tradeoffs of Large Scale Learning, NIPS tutorials, Vancouver, 2007

## Publications

**Stochastic Gradient Learning in Neural Networks**,

*Proceedings of Neuro-Nîmes 91*, EC2, Nimes, France, 1991.

*, Orsay, France, 1991.*

**Une Approche théorique de l'Apprentissage Connexionniste: Applications à la Reconnaissance de la Parole**

**Online Algorithms and Stochastic Approximations**,

*Online Learning and Neural Networks*, Edited by David Saad, Cambridge University Press, Cambridge, UK, 1998.

**Efficient Backprop**,

*Neural Networks, Tricks of the Trade*, Lecture Notes in Computer Science LNCS 1524, Springer Verlag, 1998.

**Gradient Based Learning Applied to Document Recognition**,

*Proceedings of IEEE*, 86(11):2278-2324, 1998.

**Stochastic Approximations and Efficient Learning**,

*The Handbook of Brain Theory and Neural Networks, Second edition,*, Edited by M. A. Arbib, The MIT Press, Cambridge, MA, 2002.

**Stochastic Learning**,

*Advanced Lectures on Machine Learning*, (LNAI 3176):146-168, Edited by Olivier Bousquet and Ulrike von Luxburg, Lecture Notes in Artificial Intelligence, Springer Verlag, Berlin, 2004.

**Large Scale Online Learning**,

*Advances in Neural Information Processing Systems 16*, Edited by Sebastian Thrun, Lawrence Saul and Bernhard Schölkopf, MIT Press, Cambridge, MA, 2004.

**On-line Learning for Very Large Datasets**,

*Applied Stochastic Models in Business and Industry*, 21(2):137-151, 2005.

**The Tradeoffs of Large Scale Learning**,

*Advances in Neural Information Processing Systems*, 20, MIT Press, Cambridge, MA, 2008.

**Large-Scale Machine Learning with Stochastic Gradient Descent**,

*Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT'2010)*, 177–187, Edited by Yves Lechevallier and Gilbert Saporta, Paris, France, August 2010, Springer.