====== Stochastic Gradient Descent (v.2) ====== Learning algorithms based on [[:research:stochastic|Stochastic Gradient]] approximations are known for their poor performance on optimization tasks and their extremely good performance on machine learning tasks (Bottou and Bousquet, 2008). Despite these proven capabilities, there were lingering concerns about the difficulty of setting the adaptation gains and achieving robust performance. Stochastic gradient algorithms have been historically associated with back-propagation algorithms in multilayer neural networks, which can be very challenging non-convex problems. Stochastic gradient algorithms are also notoriously hard to debug because they often appear to somehow work despite the bugs. Experimenters are then led to believe, incorrectly, that the algorithm itself is flawed. Therefore it is useful to see how Stochastic Gradient Descent performs on simple linear and convex problems such as linear [[wp>Support_Vector_Machine|Support Vector Machines (SVMs)]] or [[wp>Conditional_Random_Field|Conditional Random Fields (CRFs)]]. This page proposes simple code examples illustrating the good properties of stochastic gradient descent algorithms. The provided source code values clarity over speed. * The first release of this code (2007) was written to accompany my [[:talks:largescale|2007 NIPS tutorial]] on [[:research:largescale|large scale learning]]. * The second major release of this code (2011) adds a robust implementation of the //averaged stochastic gradient descent// algorithm (Ruppert, 1988) which consists of performing stochastic gradient descent iterations and simultaneously averaging the parameter vectors over time. When the stochastic gradient gains decrease with an appropriately slow schedule, Polyak and Juditsky (1992) have shown that the convergence is asymptotically as efficient as second-order stochastic gradient descent with much smaller computational costs. One can therefore hope to match the batch optimization performance after a single pass on the randomly shuffled training set (Fabian, 1978; Bottou and LeCun, 2004). Achieving one-pass learning in practice remains difficult because one often needs more than one pass to simply reach this favorable asymptotic regime. The gain schedule has a deep impact on this convergence. Finer analyses (Xu, 2010; Bach and Moulines, 2011) reveal useful guidelines to set these learning rates. Xu (2010) also describe a wonderful way to efficiently perform the averaging operation when the training data is sparse. On problems with moderate dimension, the resulting algorithm reaches near-optimal test set performance in one or two passes. When the dimension increases, the advantage erodes, but the algorithm remains competitive. ===== Download and Compilation ===== === Downloading the programs === * Create a clone of the [[http://git-scm.org|git]] repository with command: git clone http://leon.bottou.org/git/sgd.git * As an alternative, you can still download the tarball {{sgd-2.1.tar.gz}} released in September 2012. This is no longer the recommended approach but should still work. * Make sure to read the files ''"README.txt"'' and ''"data/README.txt"''. === Downloading the datasets === * Download the following files from the [[http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm|RCV1-V2]] web site and install them in directory ''"data/rcv1"''. * [[http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt0.dat.gz|lyrl2004_tokens_test_pt0.dat.gz]], [[http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt1.dat.gz|lyrl2004_tokens_test_pt1.dat.gz]] * [[http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt2.dat.gz|lyrl2004_tokens_test_pt2.dat.gz]], [[http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_test_pt3.dat.gz|lyrl2004_tokens_test_pt3.dat.gz]] * [[http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a12-token-files/lyrl2004_tokens_train.dat.gz|lyrl2004_tokens_train.dat.gz]], [[http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a08-topic-qrels/rcv1-v2.topics.qrels.gz|rcv1-v2.topics.qrels.gz]] * Download the following files from the [[ftp://largescale.ml.tu-berlin.de/largescale/|Pascal Large-Scale Challenge]] server and install them in directory ''"data/pascal"''. * [[ftp://largescale.ml.tu-berlin.de/largescale/scripts/convert.py|convert.py]], [[ftp://largescale.ml.tu-berlin.de/largescale/alpha/alpha_train.lab.bz2|alpha_train.lab.bz2]], [[ftp://largescale.ml.tu-berlin.de/largescale/alpha/alpha_train.dat.bz2|alpha_train.dat.bz2]] (860MB) * Optionally, [[ftp://largescale.ml.tu-berlin.de/largescale/webspam/webspam_train.lab.bz2|webspam_train.lab.bz2]], [[ftp://largescale.ml.tu-berlin.de/largescale/webspam/webspam_train.dat.bz2|webspam_train.dat.bz2]] (1.5GB) * Download the following files from the [[http://www.cnts.ua.ac.be/conll2000/chunking|CONLL2000 Chunking]] web site and install them in directory ''"data/conll2000"''. * [[http://www.cnts.ua.ac.be/conll2000/chunking/train.txt.gz|train.txt.gz]], [[http://www.cnts.ua.ac.be/conll2000/chunking/test.txt.gz|test.txt.gz]] * Run the conversion script for the Pascal datasets. This requires Python and takes some time. $ cd data/pascal $ ./convert.py -o alpha.txt alpha train $ ./convert.py -o webspam.txt webspam train # (optional) === Compiling === * To compile under Linux, simply type ''"make"'' in the toplevel directory. The compilation requires the [[http://www.zlib.net|ZLib]] library. Most Linux distributions provide this by default. Things should be very similar under other Unix operating systems. * To compile under Windows, see the file ''"win/README.txt"''. ===== Stochastic Gradient SVM ===== The SVM programs are located in directory ''"sgd/svm"''. The file ''"svm/README.txt"'' provides the full details and describes how to run various experiments. === SvmSgd === Program ''"svmsgd"'' implements a straightforward stochastic gradient descent algorithm. - The learning rate has the form //η0 / (1 + λ η0 t)// where //λ// is the regularization constant. - The constant //η0// is determined by performing preliminary experiments on a data subsample. - The learning rate for the bias is multiplied by 0.01 because this frequently improves the condition number. - The weights are represented as the product of a scalar and a vector in order to easily implement the weight decay operation that results from the regularization term. Points (1) and (4) follow (Shalev-Schwartz et al., 2007). === SvmAsgd === Program ''"svmasgd"'' implements the averaged stochastic gradient descent algorithm. - The learning rate has the form //η0 / (1 + λ η0 t) 0.75// where //λ// is the regularization constant. - The constant //η0// is determined by performing preliminary experiments on a data subsample. - Averaging starts only at the second epoch. The first epoch is a simple stochastic gradient descent. This provides more robust convergence at the expense of one additional epoch. You can use option ''-avgstart'' to change this. - Both the stochastic gradient weights and the averaged weights are represented using a linear transformation that yields efficiency gains for sparse training data. Points (1) and (4) follow (Xu, 2010). === RCV1 Benchmark === The benchmark task is the recognition of RCV1 documents belonging to the class ''CCAT''. Program ''"prep_rcv1"'' reads the RCV1-V2 token files from directory ''"data/rcv1"'' and computes the TF/IDF features on the basis of the our training set. This programs produces the two files ''"rcv1.train.bin.gz"'' and ''"rcv1.test.bin.gz"'' containing our training and testing sets. This short program also generates files in SvmLight format when compiled with option ''-DPREP_SVMLIGHT''. ^ Benchmark ^ Features ^ Training examples ^ Testing examples ^ | **RCV1** | 47152 | 781265 | 23149 | \\ ^ Algorithm (hinge loss, λ=1e-4) ^ Training Time* ^ Primal cost ^ Test Error ^ | SMO ([[http://svmlight.joachims.org|SVMLight]]) | ≃ 16000 secs1 | 0.2275 | 6.02% | | Cutting Plane ([[http://www.cs.cornell.edu/People/tj/svm_light/svm_perf.html|SVMPerf]]) | ≃ 45 secs2 | 0.2278 | 6.03% | | Hinge Loss SDCA ([[http://www.csie.ntu.edu.tw/~cjlin/liblinear|LibLinear -s 3 -B 1]]) | 2.5 secs | - | 6.02% | | SGD (''svmsgd'') | < 1 sec. | 0.2275 | 6.02% | | ASGD (''svmasgd'') | < 1 sec. | 0.2275 | 6.02% | 1 Extrapolated from a 23642 seconds run on a 30% slower machine.\\ 2 Extrapolated from a 66 seconds run on a 30% slower machine.\\ * All timings exclude data loading time.\\ \\ ^ Algorithm (log loss, λ=5e-7) ^ Training Time ^ Primal cost ^ Test Error ^ | TRON ([[http://www.csie.ntu.edu.tw/~cjlin/liblinear|LibLinear -s 0 -B 1]]) | 33 secs | - | 5.14% | | SDCA ([[http://www.csie.ntu.edu.tw/~cjlin/liblinear|LibLinear -s 7 -B 1]]) | 15 secs | - | 5.13% | | SGD (''svmsgd'') | 4 secs | 0.1283 | 5.14% | | ASGD (''svmasgd'') | 5 secs | 0.1281 | 5.13% | \\ === Alpha Benchmark === You must first convert the original Pascal files using the Python script ''convert.py'' as explained above. Program ''"prep_alpha"'' then reads the converted Pascal file ''"data/pascal/alpha.txt"'' and produces the two files ''"rcv1.train.bin.gz"'' and ''"rcv1.test.bin.gz"'' containing our training and testing sets. The alpha dataset illustrates how ASGD sometimes vastly outperforms plain SGD. The plot below shows how ASGD reaches near-optimal test performance only one pass after the activation of the averaging code (epoch 2.) ^ Benchmark ^ Features ^ Training examples ^ Testing examples ^ | **Alpha** | 500 | 250000 | 250000 | \\ ^ Algorithm (log loss, λ=1e-6) ^ Training Time ^ Primal cost ^ Test Error ^ | TRON ([[http://www.csie.ntu.edu.tw/~cjlin/liblinear|LibLinear -s 0 -B 1]]) | 115 secs | - | 21.86% | | SDCA ([[http://www.csie.ntu.edu.tw/~cjlin/liblinear|LibLinear -s 7 -B 1]]) | 19 secs | - | 21.86% | | SGD (''svmsgd'') | 51 secs | 0.4717 | 21.95% | | ASGD (''svmasgd'') | 4 secs | 0.4713 | 21.86% | \\ {{alpha-cost.png}} ===== Stochastic Gradient CRFs ===== The CRF programs ''"crfsgd"'' and ''"crfasgd"'' respectively use the SGD and ASGD algorithms to train a [[wp>Conditional_random_field|Conditional Random Field]]. The algorithms are setup exactly as the SVM variants, but the implementation accounts for the greater structural complexity of conditional random fields. Both programs accept the same input files as the well known [[http://crfpp.sourceforge.net|CRF++]] software by [[http://www.chasen.org/~taku|Taku Kudo]]. They produce the same tagging files which can be analyzed using the CONLL perl script ''"conlleval"''. They use the same template files to specify the features and accept the same data files. For convenience, ''"crfsgd"'' and ''"crfasgd"'' can also directly read gzipped data files. See the CRF++ documentation for details. === CONLL 2000 Benchmark === Our CRF benchmark is the [[http://www.cnts.ua.ac.be/conll2000/chunking|CONLL2000 Text Chunking]] task, which consists of dividing a text in syntactically correlated segments (noun phrases, verb phrases, etc.). We use the feature template that comes with CRF++. ^ Benchmark ^ Parameters ^ Training sentences ^ Training chunks ^ Testing sentences ^ Testing chunks ^ | ** CONLL2000 ** | 1679700 | 8936 | 106978 | 2012 | 23852 | \\ ^ Algorithm (-f 3 -c 1) ^ Training Time ^ Primal cost ^ Test F1 score ^ | LBFGS ([[http://crfpp.sourceforge.net|CRF++]]) | ≃ 3000 secs3 | 9042 | 93.74% | | SGD (''crfsgd'') | 303 secs | 9096 | 93.73% | | ASGD (''crfasgd'') | 134 secs | 9204 | 93.79% | 3 Extrapolated from a 4335 seconds run on a 30% slower machine.\\ {{crf-f1.png}} ===== References ===== Francis Bach, Eric Moulines. **Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning**. Technical report, to appear in Advances in Neural Information Processing Systems (NIPS), 2011. [[http://hal.archives-ouvertes.fr/hal-00608041/en|HAL Link (hal-00608041).]] Léon Bottou: **Stochastic Gradient Learning in Neural Networks**, //Proceedings of Neuro-Nîmes 91//, EC2, Nimes, France, 1991. [[:papers/bottou-91c|more...]] Léon Bottou: **Online Algorithms and Stochastic Approximations**, //Online Learning and Neural Networks//, Edited by David Saad, Cambridge University Press, Cambridge, UK, 1998. [[:papers/bottou-98x|more...]] Léon Bottou and Yann LeCun: **Large Scale Online Learning**, //Advances in Neural Information Processing Systems 16//, Edited by Sebastian Thrun, Lawrence Saul and Bernhard Schölkopf, MIT Press, Cambridge, MA, 2004. [[:papers/bottou-lecun-2004|more...]] Léon Bottou and Olivier Bousquet: **The Tradeoffs of Large Scale Learning**, //Advances in Neural Information Processing Systems//, 20, MIT Press, Cambridge, MA, 2008. [[:papers/bottou-bousquet-2008|more...]] Léon Bottou: **Large-Scale Machine Learning with Stochastic Gradient Descent**, //Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT'2010)//, 177--187, Edited by Yves Lechevallier and Gilbert Saporta, Paris, France, August 2010, Springer. [[:papers/bottou-2010|more...]] Boris T. Polyak, Anatoly B. Juditsky: **Acceleration of stochastic approximation by averaging**, //SIAM Journal on Control and Optimization//, 30(4):838-855, 1992. [[http://dl.acm.org/citation.cfm?id=131098|ACM Link]] David Ruppert. **Efficient estimations from a slowly convergent Robbins-Monro process**. Technical Report 781, Cornell University Operations Research and Industrial Engineering, 1988. Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro: **Pegasos: Primal Estimated subGrAdient sOlver for Svm**, //Proceedings of the 24th International Conference on Machine Learning//, 2007. [[http://www.machinelearning.org/proceedings/icml2007/papers/587.pdf|pdf]]. S. V. N. Vishwanathan, Nicol N. Schraudolph, Mark Schmidt, Kevin Murphy: **Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods**. //Proceedings of the 23rd International Conference on Machine Learning//, 969–976, ACM Press, 2006. [[http://www.stat.purdue.edu/~vishy/papers/VisSchSchMur06.pdf|pdf]] Wei Xu: **Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent**, Technical report, 2010. [[http://arxiv.org/abs/1107.2490|ArXiv Link (arXiv:1107.2490v1)]] ===== Changelog ===== ^ Version ^ Date |Comments ^ | sgd-1.0.tar.gz | 2007-08-22 |Initial release. | | {{sgd-1.1.tar.gz}} | 2007-10-04 |Relicensed under LGPL | | {{sgd-1.2.tar.gz}} | 2008-04-07 |Bug fix in ''svm/reprocess.cpp''. Thanks to Wei Xu. | | {{sgd-1.3.tar.gz}} | 2009-01-31 |Added SSE2 code for dense vectors. | | {{sgd-2.0.tar.gz}} | 2011-10-12 |Major release featuring sgd and asgd. | | {{sgd-2.1.tar.gz}} | 2012-09-14 |Bug fix: istream::unget() does not clear eof bit | | Latest | ''git clone ht''''tp:leon.bottou.org/git/sgd.git'' || \\ ===== Related projects ===== Rather than emphasizing performance or generality, this code favors readability. I am therefore glad to see that many authors of machine learning projects have found it useful, sometimes directly, sometimes as a source of inspiration. Here are a few links: * [[http://faculty.chicagobooth.edu/Panagiotis.toulis/implicitSGD.html|Implicit SGD]] uses the SVMSGD code as a starting point. * [[http://scikit-learn.org|SciKit]] contains a [[http://scikit-learn.org/0.11/modules/sgd.html|SGD optimizer]] inspired by SVMSGD. * [[http://www.shogun-toolbox.org|Shogun]] contains a [[http://www.shogun-toolbox.org/doc/en/current/classshogun_1_1CSVMSGD.html|SGD module]] derived from SVMSGD. * [[http://hunch.net/~vw/|Vowpal Wabbit]] goes to [[http://hunch.net/?p=309|great]] [[https://github.com/JohnLangford/vowpal_wabbit/wiki/Rcv1-example|lengths]] to go faster than SVMSGD ;-) * [[http://nlp.stanford.edu/nlp/javadoc/javanlp/|Stanford's JavaNLP project]] [[http://www-nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/optimization/SGDMinimizer.html|found]] [[http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ie/crf/CRFFeatureExporter.html|inspiration]] in CRFSGD. * ...