Stochastic Gradient Descent (SGD) has been historically associated with back-propagation algorithms in multilayer neural networks. These nonlinear nonconvex problems can be very difficult. Therefore it is useful to see how Stochastic Gradient Descent performs on simple linear and convex problems such as linear Support Vector Machines (SVMs) or Conditional Random Fields (CRFs).
Disclaimer 1: These are just toy examples for illustrating the properties of SGD. This code is designed for clarity rather than speed. I do not advocate using CRFs when simpler solutions exist.
Disclaimer 2: The following results have been obtained using plain stochastic gradient. This page does not advocate fancy modifications of an old algorithm, it just shows that the old algorithm itself is amazingly fast!
“sgd/data/rcv1”.“sgd/data/conll2000”.“make” in the toplevel directory. The compilation requires the ZLib library. Most Linux distributions provide this by default. Things should be very similar under other Unix operating systems.
The SVM programs are located in directory “sgd/svm”.
See the file “sgd/svm/README” for more details.
Our SVM benchmark is the recognition of
the RCV1 documents belonging to the class CCAT.
In order to have a large training set we swap the official
RCV1-V2
training and test sets.
The features are the 47152 tokens
belonging to both the training and test sets.
| RCV1-V2 (class “ccat” versus rest) | |
|---|---|
| Training examples | 781265 |
| Testing examples | 23149 |
| Parameters | 47152 |
Program “preprocess” reads the RCV1-V2 token files from directory “data/rcv1” and computes the TF/IDF features on the basis of the our training set. This programs produces four files. The gzipped files “train.dat.gz” and “test.dat.gz” contain the training and test data in SVMLight format. The files “train.bin.gz” and “test.bin.gz” contain the same information under a binary format that loads faster.
Program “svmsgd” implements a straightforward stochastic gradient descent algorithm.
Points (1) and (4) were inspired by (Shalev-Schwartz et al., 2007).
Program “svmsgd2” illustrates another way to address some of these problems.
The following tables reports the essential results. Programs SVMLight and SVMPerf are well known SVM solvers written by Thorsten Joachims. SVMLight is suitable for SVMs with arbitrary kernels. Similar results could be achieved using Chih-Jen Lin's LibSVM software. SVMPerf is a specialized solver for linear SVMs. It is considered to be one of the most efficient optimizer for this particular problem.
| Algorithm (hinge loss) | Training Time | Primal cost | Test Error |
|---|---|---|---|
| SVMLight | 23642 secs | 0.2275 | 6.02% |
| SVMPerf | 66 secs | 0.2278 | 6.03% |
Stochastic Gradient (svmsgd) | 1.4 secs | 0.2275 | 6.02% |
Stochastic Gradient (svmsgd2) | 1.4 secs | 0.2275 | 6.01% |
Chih-Jen Lin's LibLinear
software is a sophisticated solver for regularized linear systems using
differentiable losses such as the logarithmic loss.
The following results have been achieved by setting the following
constants in the “svmsgd2” source code.
#define LOSS LOGLOSS #define BIAS 1 #define REGULARIZEBIAS 1
It is in fact very easy to change the loss function or the regularizer in a Stochastic Gradient Descent code. We just change the functions that computes the gradients.
| Algorithm (log loss) | Training Time | Primal cost | Test Error |
|---|---|---|---|
| LibLinear (eps=0.01) | 30 secs | 0.18907 | 5.68% |
| LibLinear (eps=0.001) | 44 secs | 0.18890 | 5.70% |
Stochastic Gradient (svmsgd2) | 2.3 secs | 0.18893 | 5.66% |
The CRF program “crfsgd” is located in directory “sgd/svm”.
See the file “sgd/crf/README” for details.
Our CRF benchmark is the CONLL2000 Text Chunking task, which consists of dividing a text in syntactically correlated segments (noun phrases, verb phrases, etc.). We use the feature template that comes with CRF++.
| CONLL2000 Chunking Task | |
|---|---|
| Training sentences | 8936 |
| Training chunks | 106978 |
| Testing sentences | 2012 |
| Testing chunks | 23852 |
| Parameters | 1679700 |
The program “crfsgd” is a general purpose CRF solver.
It relies on a straightforward stochastic gradient.
The program interface is modelled after the CRF++
software by Taku Kudo.
It uses the same template files to specify the features,
takes the same data files after gzip compression, and
produces the same tagging files than can be
analyzed using the same perl script “conlleval”.
See the CRF++ documentation for details.
Usage (training): crfsgd [options] model template traindata [devdata] Usage (tagging): crfsgd -t model testdata Options for training: -c <num> : capacity control parameter (1.0) -f <num> : threshold on the occurences of each feature (3) -r <num> : total number of epochs (50) -h <num> : epochs between each testing phase (10) -e <cmd> : performance evaluation command (conlleval -q) -q : silent mode
We compare our results with those achieved by the CRF++ program using the same parametrization. CRF++ utilizes a Limited-storage BFGS optimizer which is considered to be one of the best optimizers for Conditional Random Fields. Yet our simple stochastic gradient descent runs faster.
| Algorithm | Training Time | Primal cost | Test F1 score |
|---|---|---|---|
| CRF++ (LBFGS optimizer) | 4335 secs | 9042 | 93.74% |
Stochastic Gradient (crfsgd) | 568 secs | 9098 | 93.75% |
Note: Graph Transformer Networks (1997) can be described as sophisticated nonlinear Conditional Random Fields trained with Stochastic Gradient Descent.
Léon Bottou: Stochastic Gradient Learning in Neural Networks, Proceedings of Neuro-Nîmes 91, EC2, Nimes, France, 1991.
Léon Bottou: Online Algorithms and Stochastic Approximations, Online Learning and Neural Networks, Edited by David Saad, Cambridge University Press, Cambridge, UK, 1998.
Léon Bottou and Yann LeCun: Large Scale Online Learning, Advances in Neural Information Processing Systems 16, Edited by Sebastian Thrun, Lawrence Saul and Bernhard Schölkopf, MIT Press, Cambridge, MA, 2004.
Léon Bottou and Olivier Bousquet: The Tradeoffs of Large Scale Learning, Advances in Neural Information Processing Systems, 20, MIT Press, Cambridge, MA, 2008.
A lot more material is available on my pages about Stochastic Gradient Learning.
Two recent papers explore variants of Stochastic Gradient descent for SVMs (Shalev-Schwartz et al., 2007) and CRFs (Vishwanathan et al., 2006). Both papers stress valuable improvements to the plain Stochastic Gradient Descent. However the above benchmarks show that a careful implementation plain Stochastic Gradient Descent is already very competitive.
| Version | Comments |
|---|---|
sgd-1.0.tar.gz | Initial release. |
| sgd-1.1.tar.gz | Relicensed under LGPL |
| sgd-1.2.tar.gz | Fixed bug in svm/reprocess.cpp. Thanks to Wei Xu. |