Learning with Stochastic Gradient Descent
Many numerical learning algorithms amount
to optimizing a cost function that can be
expressed as an average over the training examples.
The loss function measures how well (or how poorly)
the learning system performs on each example.
The cost function is then the average of the
loss function measures on all training examples,
possibly augmented with capacity control terms.
Computing such an average takes a time
proportional to the number of examples n.
This constant always appears in the running time
of all optimization algorithm that considers the
complete cost function.
Stochastic gradient descent instead
updates the learning system on the basis
of the loss function measured for a single example.
Such an algorithm works because the averaged
effect of these updates is the same.
Although the convergence is much more noisy,
the elimination of the constant n in the
computing cost can be a huge advantage for
Stochastic Gradient Descent has been historically associated
with back-propagation algorithms in multilayer neural networks.
These nonlinear nonconvex problems can be very difficult.
Therefore it is useful to see how Stochastic Gradient Descent
performs on simple linear and convex problems such
as linear Support Vector Machines (SVMs)
or Conditional Random Fields (CRFs).
Source code and benchmarks are available on
the page about Stochastic Gradient Examples.
The benchmarks show very clearly the superiority
of Stochastic Gradient Descent on large learning problems.
Many learning algorithms of the early nineties can be expressed as stochastic gradient descent,
including perceptrons, adalines, multilayer networks, k-means, learning vector quantization, etc.
Studying stochastic optimization allowed me to characterize their convergence
and their speed
in a unified framework.
This theoretical background became a solid basis to study
the practical issues surrounding stochastic gradient descent,
and for designing learning algorithms for new categories of problems.
Léon Bottou: Stochastic Gradient Learning in Neural Networks
, Proceedings of Neuro-Nîmes 91
, EC2, Nimes, France, 1991.
Léon Bottou: Une Approche théorique de l'Apprentissage Connexionniste: Applications à la Reconnaissance de la Parole
, Orsay, France, 1991.
Léon Bottou: Online Algorithms and Stochastic Approximations
, Online Learning and Neural Networks
, Edited by David Saad, Cambridge University Press, Cambridge, UK, 1998.
Yann Le Cun, Léon Bottou, Genevieve B. Orr and Klaus-Robert Müller: Efficient Backprop
, Neural Networks, Tricks of the Trade
, Lecture Notes in Computer Science LNCS 1524, Springer Verlag, 1998.
Yann Le Cun, Léon Bottou, Yoshua Bengio and Patrick Haffner: Gradient Based Learning Applied to Document Recognition
, Proceedings of IEEE
, 86(11):2278-2324, 1998.
Léon Bottou and Noboru Murata: Stochastic Approximations and Efficient Learning
, The Handbook of Brain Theory and Neural Networks, Second edition,
, Edited by M. A. Arbib, The MIT Press, Cambridge, MA, 2002.
Léon Bottou: Stochastic Learning
, Advanced Lectures on Machine Learning
, (LNAI 3176):146-168, Edited by Olivier Bousquet and Ulrike von Luxburg, Lecture Notes in Artificial Intelligence, Springer Verlag, Berlin, 2004.
Léon Bottou and Yann LeCun: Large Scale Online Learning
, Advances in Neural Information Processing Systems 16
, Edited by Sebastian Thrun, Lawrence Saul and Bernhard Schölkopf, MIT Press, Cambridge, MA, 2004.
Léon Bottou and Yann LeCun: On-line Learning for Very Large Datasets
, Applied Stochastic Models in Business and Industry
, 21(2):137-151, 2005.
Léon Bottou and Olivier Bousquet: The Tradeoffs of Large Scale Learning
Advances in Neural Information Processing Systems
MIT Press, Cambridge, MA, 2008.
Léon Bottou: Large-Scale Machine Learning with Stochastic Gradient Descent
, Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT'2010)
, 177–187, Edited by Yves Lechevallier and Gilbert Saporta, Paris, France, August 2010, Springer.