====== The infinite MNIST dataset ====== **Formerly known as ''MNIST8M''.** ==== 1. Background ==== This code produces an infinite supply of digit images derived from the well known MNIST dataset using pseudo-random deformations and translations. This is a streamlined version of the code used for the experiments reported in [[:papers:loosli-canu-bottou-2006|(Loosli, Canu, Bottou, 2007)]]. A subset of the examples generated by this code are known as MNIST8M. Unfortunately the original MNIST8M files have been deleted from the NEC servers. However you can use InfiMNIST to regenerate these files or generate much larger files if you prefer. You can even use this code to generate deformed MNIST examples on the fly. Each InfiMNIST example is identified by a long integer index that determines the source of the example and the transformations applied to the pattern. The examples numbered 0 to 9999 are the standard MNIST testing examples. The examples numbered 10000 to 69999 are the standard MNIST training examples. Each example with index //i>=70000// is generated by applying a pseudorandom transformation to the MNIST training example numbered //10000+((i-10000)%60000)//. Because the pseudo-random transformations are deterministically derived from the example number, this is similar to having a file containing about one trillion distinct MNIST examples. ==== 2. Data Files ==== Six data files are located in directory ''data'' of the source archive. * Files ''{t10k,train}-images-idx3-ubyte'' and ''{t10k,train}-labels-idx1-ubyte'' are the [[http://yann.lecun.com/exdb/mnist|pristine MNIST data files]]. * File ''tangVec_float_60000x28x28.bin'' contains precomputed tangent vectors for the MNIST training images. * File ''fields_float_1522x28x28.bin'' contains pseudo-random vector fields used to generate the character deformations. All six files must be available at execution time and reside in the same directory ==== 3. Download and Compilation ==== Download the source file using the following link: ^ File ^ Version ^ Size ^ Notes ^ | n/a | 1.1 | 349MB | initial release. | | n/a | 1.2 | 349MB | added more output formats. | | {{infimnist.tar.gz}} | 1.3 | 350MB | generated data exactly matches mnist8m (bug fix) | The supplied makefiles are very standard and should work on nearly all machines. Customizing the variable ''CFLAGS'' could possibly achieve better performance. * Linux/Unix/Cygwin: Unpack the archive and type ''make''. * Windows: Unpack the archive and type ''nmake /f NMakefile'' in a MSVC shell. ==== 4. Using the InfiMNIST executable ==== Synopsis: $ infimnist [-d ] Option ''-d '' can be used to specify the location of the six data files. The default data directory is simply ''data'' in the current directory. Arguments '''' and '''' define the first and last index of the range of examples written to the standard output. Argument '''' describes the format of the produced data. Any unambiguous prefix of the following formats are recognized: * ''patterns'' produces an image file using the standard MNIST binary format. * ''labels'' produce a label files using the standard MNIST binary format. * ''svmlight'' produces a file suitable for [[http://svmlight.joachims.org/|SVMLight]] or [[http://www.csie.ntu.edu.tw/~cjlin/libsvm/|LibSVM]]. * ''vw'' produces a file suitable for [[http://hunch.net/~vw/|Vowpal Wabbit]]. * ''arff'' produces a sparse file suitable for [[http://www.cs.waikato.ac.nz/ml/weka|Weka]]. * ''display'' produces rudimentary ASCII art. Examples: * Generating files containing the standard MNIST testing set: $ infimnist lab 0 9999 > test10k-labels $ infimnist pat 0 9999 > test10k-patterns * Generating files containing the standard MNIST training set: $ infimnist lab 10000 69999 > mnist60k-labels-idx1-ubyte $ infimnist pat 10000 69999 > mnist60k-patterns-idx3-ubyte * Generating an ASCII art version of the first ten deformed characters: $ infimnist disp 70000 70009 * Generating files containing the MNIST8M training set with a format similar to the standard MNIST files. This is intended to provide //exactly the same data// as the original MNIST8M dataset. A bug in releases 1.1 and 1.2 uprevents this from happening on 64 bit machines. This bug was fixed in release 1.3. $ infimnist lab 10000 8109999 > mnist8m-labels-idx1-ubyte $ infimnist pat 10000 8109999 > mnist8m-patterns-idx3-ubyte * Generating a LibSVM compatible MNIST8M file. This file is expected to be //identical// to the MNIST8M file saved on the [[https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html|libsvm]] web site. $ infimnist svm 10000 8109999 > mnist8m-libsvm.txt ==== 5. Using InfiMNIST as a library ==== Files ''infimnist.h'' and ''infimnist.c'' form a self-contained library that you can use to generate an infinite amount of MNIST-like examples on the fly. This is adequately explained by the comments found in file ''infimnist.h'' reproduced below /* Function creates the infimnist_t data structure that contains the digit data (about 450MB) and caches up to about 1GB worth of deformed digit images. The argument points to the directory containing the data files. Setting it to NULL implicitly selects the directory named "data" in the current directory. */ infimnist_t *infimnist_create(const char *datadir); /* Function destroys the data structure and returns its memory to the heap. */ void infimnist_destroy(infimnist_t*); /* Function returns the label (0 to 9) associated with example . */ int infimnist_get_label(infimnist_t*, long index); /* Function returns the image associated with the example numbered . The image takes the form of a vector of 784 unsigned bytes organized in row major order. Each bytes takes a value ranging from 0 (white) to 255 (black). There is no need to free the resulting pointer as it directly points into the pattern cache. These vectors may be automatically deallocated in the future. However, at any time, you can safely access the last ten vectors returned by this function. */ const unsigned char *infimnist_get_pattern(infimnist_t*, long index); ==== 6. Credits ==== The original code was written by [[http://gaelle.loosli.fr|Gaëlle Loosli]] and [[:start|Léon Bottou]] in 2007. The generation of deformed digits makes heavy use of the techniques pioneered by [[http://research.microsoft.com/en-us/people/patrice/|Patrice Simard]] and his coauthors in their [[http://papers.nips.cc/paper/536-tangent-prop-a-formalism-for-specifying-selected-invariances-in-an-adaptive-network|tangent prop paper]].