Stochastic Gradient Descent Examples on Toy Problems

Stochastic Gradient Descent (SGD) has been historically associated with back-propagation algorithms in multilayer neural networks. These nonlinear nonconvex problems can be very difficult. Therefore it is useful to see how Stochastic Gradient Descent performs on simple linear and convex problems such as linear Support Vector Machines (SVMs) or Conditional Random Fields (CRFs).

Disclaimer 1: These are just toy examples for illustrating the properties of SGD. This code is designed for clarity rather than speed. I do not advocate using CRFs when simpler solutions exist.

Disclaimer 2: The following results have been obtained using plain stochastic gradient. This page does not advocate fancy modifications of an old algorithm, it just shows that the old algorithm itself is amazingly fast!

Download and Compilation

  • To compile under Linux, simply type “make” in the toplevel directory. The compilation requires the ZLib library. Most Linux distributions provide this by default. Things should be very similar under other Unix operating systems.

Stochastic Gradient SVMs

The SVM programs are located in directory “sgd/svm”. See the file “sgd/svm/README” for more details.

Benchmark

Our SVM benchmark is the recognition of the RCV1 documents belonging to the class CCAT. In order to have a large training set we swap the official RCV1-V2 training and test sets. The features are the 47152 tokens belonging to both the training and test sets.

RCV1-V2 (class “ccat” versus rest)
Training examples 781265
Testing examples 23149
Parameters 47152


Preprocess

Program “preprocess” reads the RCV1-V2 token files from directory “data/rcv1” and computes the TF/IDF features on the basis of the our training set. This programs produces four files. The gzipped files “train.dat.gz” and “test.dat.gz” contain the training and test data in SVMLight format. The files “train.bin.gz” and “test.bin.gz” contain the same information under a binary format that loads faster.

SvmSgd

Program “svmsgd” implements a straightforward stochastic gradient descent algorithm.

  1. The learning rate has the form 1/λ(t+t0) where λ is the regularization constant.
  2. The constant t0 is determined heuristically to ensure that the initial update is comparable to the expected size of the weights.
  3. The learning rate for the biases is multiplied by 0.01 because the biases are updated at each iteration while most other weights are updated less frequently.
  4. The weights are represented as the product of a scalar and a vector in order to easily implement the weight decay operation that results from the regularization term.

Points (1) and (4) were inspired by (Shalev-Schwartz et al., 2007).

SvmSgd2

Program “svmsgd2” illustrates another way to address some of these problems.

  1. The weight vector is represented directly. The weight decay operation is costly because it affects all the weights coefficients. To compensate, the weight decay is performed stochastically with a low probability and a higher gain.
  2. A calibration function performs sparsity measurements on the training data in order to determine the proper schedule for the weight decay operation and also to estimate a multiplier for the bias learning rates.

SVM Results

The following tables reports the essential results. Programs SVMLight and SVMPerf are well known SVM solvers written by Thorsten Joachims. SVMLight is suitable for SVMs with arbitrary kernels. Similar results could be achieved using Chih-Jen Lin's LibSVM software. SVMPerf is a specialized solver for linear SVMs. It is considered to be one of the most efficient optimizer for this particular problem.

Algorithm (hinge loss) Training Time Primal cost Test Error
SVMLight 23642 secs 0.2275 6.02%
SVMPerf 66 secs 0.2278 6.03%
Stochastic Gradient (svmsgd) 1.4 secs 0.2275 6.02%
Stochastic Gradient (svmsgd2) 1.4 secs 0.2275 6.01%


Log-loss SVM Results

Chih-Jen Lin's LibLinear software is a sophisticated solver for regularized linear systems using differentiable losses such as the logarithmic loss. The following results have been achieved by setting the following constants in the “svmsgd2” source code.

#define LOSS LOGLOSS
#define BIAS 1
#define REGULARIZEBIAS 1

It is in fact very easy to change the loss function or the regularizer in a Stochastic Gradient Descent code. We just change the functions that computes the gradients.

Algorithm (log loss) Training Time Primal cost Test Error
LibLinear (eps=0.01) 30 secs 0.18907 5.68%
LibLinear (eps=0.001) 44 secs 0.18890 5.70%
Stochastic Gradient (svmsgd2) 2.3 secs 0.18893 5.66%


Stochastic Gradient CRFs

The CRF program “crfsgd” is located in directory “sgd/svm”. See the file “sgd/crf/README” for details.

Benchmark

Our CRF benchmark is the CONLL2000 Text Chunking task, which consists of dividing a text in syntactically correlated segments (noun phrases, verb phrases, etc.). We use the feature template that comes with CRF++.

CONLL2000 Chunking Task
Training sentences 8936
Training chunks 106978
Testing sentences 2012
Testing chunks 23852
Parameters 1679700


CrfSgd

The program “crfsgd” is a general purpose CRF solver. It relies on a straightforward stochastic gradient.

  1. The learning rate has again the form 1/λ(t+t0).
  2. The constant t0 is determined by a calibration function that tries a number of constant learning rates using a random sample of 1000 training examples. Using a small sample makes this process very fast. We select the learning rate that achieves the strongest reduction of the objective function, we conservatively divide it by two, and choose t0 to obtain that initial learning rate.

The program interface is modelled after the CRF++ software by Taku Kudo. It uses the same template files to specify the features, takes the same data files after gzip compression, and produces the same tagging files than can be analyzed using the same perl script “conlleval”. See the CRF++ documentation for details.

Usage (training): crfsgd [options] model template traindata [devdata]
Usage (tagging):  crfsgd -t model testdata
Options for training:
 -c <num> : capacity control parameter (1.0)
 -f <num> : threshold on the occurences of each feature (3)
 -r <num> : total number of epochs (50)
 -h <num> : epochs between each testing phase (10)
 -e <cmd> : performance evaluation command (conlleval -q)
 -q       : silent mode

Results

We compare our results with those achieved by the CRF++ program using the same parametrization. CRF++ utilizes a Limited-storage BFGS optimizer which is considered to be one of the best optimizers for Conditional Random Fields. Yet our simple stochastic gradient descent runs faster.

Algorithm Training Time Primal cost Test F1 score
CRF++ (LBFGS optimizer) 4335 secs 9042 93.74%
Stochastic Gradient (crfsgd) 568 secs 9098 93.75%


Note: Graph Transformer Networks (1997) can be described as sophisticated nonlinear Conditional Random Fields trained with Stochastic Gradient Descent.

Further Readings

Léon Bottou: Stochastic Gradient Learning in Neural Networks, Proceedings of Neuro-Nîmes 91, EC2, Nimes, France, 1991.

more...

Léon Bottou: Online Algorithms and Stochastic Approximations, Online Learning and Neural Networks, Edited by David Saad, Cambridge University Press, Cambridge, UK, 1998.

more...

Léon Bottou and Yann LeCun: Large Scale Online Learning, Advances in Neural Information Processing Systems 16, Edited by Sebastian Thrun, Lawrence Saul and Bernhard Schölkopf, MIT Press, Cambridge, MA, 2004.

more...

Léon Bottou and Olivier Bousquet: The Tradeoffs of Large Scale Learning, Advances in Neural Information Processing Systems, 20, MIT Press, Cambridge, MA, 2008.

more...

A lot more material is available on my pages about Stochastic Gradient Learning.

Two recent papers explore variants of Stochastic Gradient descent for SVMs (Shalev-Schwartz et al., 2007) and CRFs (Vishwanathan et al., 2006). Both papers stress valuable improvements to the plain Stochastic Gradient Descent. However the above benchmarks show that a careful implementation plain Stochastic Gradient Descent is already very competitive.

Changelog

Version Comments
sgd-1.0.tar.gz Initial release.
sgd-1.1.tar.gz Relicensed under LGPL
sgd-1.2.tar.gz Fixed bug in svm/reprocess.cpp. Thanks to Wei Xu.


 
projects/sgd.txt · Last modified: 2010/02/22 10:31 by leonb
Recent changes RSS feed Creative Commons License DjVu Enabled Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki