Many numerical learning algorithms amount to optimizing a cost function that can be expressed as an average over the training examples. The loss function measures how well (or how poorly) the learning system performs on each example. The cost function is then the average of the loss function measures on all training examples, possibly augmented with capacity control terms.
Computing such an average takes a time proportional to the number of examples n. This constant always appears in the running time of all optimization algorithm that considers the complete cost function. Stochastic gradient descent instead updates the learning system on the basis of the loss function measured for a single example. Such an algorithm works because the averaged effect of these updates is the same. Although the convergence is much more noisy, the elimination of the constant n in the computing cost can be a huge advantage for large-scale problems
Stochastic Gradient Descent has been historically associated with back-propagation algorithms in multilayer neural networks. These nonlinear nonconvex problems can be very difficult. Therefore it is useful to see how Stochastic Gradient Descent performs on simple linear and convex problems such as linear Support Vector Machines (SVMs) or Conditional Random Fields (CRFs). Source code and benchmarks are available on the page about Stochastic Gradient Examples. The benchmarks show very clearly the superiority of Stochastic Gradient Descent on large learning problems.
Many learning algorithms of the early nineties can be expressed as stochastic gradient descent, including perceptrons, adalines, multilayer networks, k-means, learning vector quantization, etc. Studying stochastic optimization allowed me to characterize their convergence and their speed in a unified framework. This theoretical background became a solid basis to study the practical issues surrounding stochastic gradient descent, and for designing learning algorithms for new categories of problems.