Alex Peysakhovich and I represent Facebook on the organizing committee of the NYC Data Science Seminar Series. This rotating seminar organized by Columbia, CornellTech, Facebook, Microsoft Research NYC, and New York University has featured a number of prominent speakers. Although there is a nice web site and a Facebook page, the web site does not yet show prominently in search engines. May be because it uses the nyc toplevel domain? Anyway, this is why I am posting this little blurb with a lots of links. The URL is http://datascienceseminar.nyc.
I was scavenging my old emails a couple weeks ago and found a copy of an early technical report that not only describes Graph Transformer Networks in a couple pages but also explains why they are defined the way they are.
Although Graph Transformer Networks have been introduced twenty years ago, they are considerably more powerful than most structured output machine learning methods. Not only do they handle the label bias problem as well as CRFs, but their hierarchical and modular structure lends itself to many refinements: they can be trained with weak supervision; they can handle pruned search strategies and adapt training to make the pruning work better; and they also provide a proven framework to reuse existing code and heuristics.
Graph transformer networks were also very successful in the real world. They have been used in commercially deployed check reading machines for more than a decade, processing about one billion checks per year. Unfortunately they are described in hard-to-read papers, such as (Bottou et al., CVPR 1997)) or such as the rarely read second half of (LeCun et al., IEEE 1998).
This tech report is now available on my web site.
Why settle for 60000 MNIST training examples when you can have one trillion?
The MNIST8M dataset was generated using the elastic deformation code originally written for (Loosli, Canu, and Bottou, 2007). Unfortunately the original MNIST8M files were accidentally deleted from the NEC servers a couple weeks ago. Instead of regenerating the files, I have repackaged the generation code in a convenient form. You can now generate arbitrary amounts of pseudo-random MNIST training examples. You can even use this code to generate your training data on the fly. We call this the infinite MNIST dataset.
Our paper “Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising” has appeared in JMLR. This paper takes the example of ad placement to illustrate how one can leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select changes that improve both the short-term and long-term performance. In particular, the paper demonstrates the connection between the classic explore–exploit and correlation–causation issues in machine learning and statistics.
Nips just took place near Lake Tahoe. Many people have written how things are changing in machine learning. There also were many interesting papers and invited talks. Thanks to the program chairs Max and Zoubin for producing this exciting conference program. Thanks to the workshop chairs Rich Caruana and Gunnar Rätsch for the stimulating workshops. Thanks to Terry Sejnowsky for creating NIPS, and special thanks to Mary-Ellen Perry without whom nothing would happen.
The report “Counterfactual Reasoning and Learning Systems” shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select changes that improve both the short-term and long-term performance of such systems. This work is illustrated by experiments carried out on the ad placement system associated with the Bing search engine.
Announcing version 2.0 of my Stochastic Gradient Descent package. This release provides implementations of the Stochastic Gradient Descent and Averaged Stochastic Gradient Descent algorithms for Linear SVMs and CRFs. The latter sometimes shows vastly superior performance. See the SGD package pages for details.
Ronan's masterpiece, "Natural Language Processing (Almost) from Scratch", has been published in JMLR. This paper describes how to use a unified neural network architecture to solve a collection of natural language processing tasks with near state-of-the-art accuracies and ridiculously fast processing speed. A couple thousand lines of C code processes english sentence at more than 10000 words per second and outputs part-of-speech tags, named entity tags, chunk boundaries, semantic role labeling tags, and, in the latest version, syntactic parse trees. Download SENNA!
Learning Semantics, Nips 2011 Workshop, Saturday December 17, 2011. Melia Sierra Nevada & Melia Sol y Nieve, Sierra Nevada, Spain.
This workshop is organized in collaboration with Antoine Bordes, Jason Weston, Ronan Collobert. This event should be very interesing: I believe that recent machine learning advances indicate new connections between machine learning and machine reasoning and lead to new opportunties for learning the semantics of the world.
Many machine learning authors write that a certain fundamental combinatorial result was independently established by Vapnik and Chervonenkis (1971), Sauer (1972), Shelah (1972), and sometimes Perles and Shelah (reference unknown). Vapnik and Chervonenkis published a version of their results in the Proceedings of the USSR Academy of Sciences four years earlier in 1968. It also appears that Sauer and Shelah pursued this result for very different purposes.
Patrice Simard and I have been friends since the old AT&T Bell Labs times. He eventually convinced me to work for him at Microsoft. He told me to expect “interesting times”.
I can see several reasons for these interesting times.
Rob Schapire and David Blei gave me the opportunity to teach the cos424 course at Princeton University for the spring 2010 semester. In fact Rob is on sabbatical leave at Yahoo! and David is parenting. Running the orphan course was a useful experience. One thousand slides later, I am really eager to see the student projects…
It is the nineties again. Ronan Collobert from NEC Labs just released a noncommercial version of his neural network system for semantic extraction. Given an input sentence in plain english, Senna outputs a host of Natural Language Processing (NLP) tags: part-of-speech (POS) tags, chunking (CHK), name entity recognition (NER), and semantic role labeling (SRL). Senna does this with state-of-the-art accuracies, roughly two hundred times faster than competing approaches.
The Senna source code represents about 2000 lines of C. This is probably one thousand times smaller than your usual natural language processing program. In fact all the Senna tagging tasks are performed using the same neural network simulation code.
OLaRank is an online solver of the dual formulation of support vector machines for structured output spaces. The algorithm can use exact or greedy inference. Its running time scales linearly with the data size, competitive with a perceptron based on the same inference procedure. Its accuracy however is much better as it replicates the accuracy of a structured SVM. See the ECML/PKDD paper "Sequence Labelling SVMs Trained in One Pass" for details.
A page has been allocated for my segment of the NIPS 2007 Tutorials. The second part of the tutorial Learning with Large Datasets was given by Alex Gray. Alex had to replace Andrew Moore on short notice because airplane delays conspired against our initial plans. The page contains the slides and a video recording a the lecture I gave at Microsoft Research a few days after NIPS.
During the 4th Annual Gala of the New York Academy of Sciences, I became one of the happy winners of the first Blavatnik Award for Young Scientists. The other finalists were very impressive. Choosing the winners must have been difficult. Leonard_Blavatnik told me he attended the Nobel ceremony a few years ago and thought that something similar should be done in New York for younger scientists. Apparently he plans to fund a similar award every year.
The talks page contains pointers to my most significant lectures. Slides are available under both the PDF and DjVu formats.
You can now download fast stochastic gradient optimizers for linear Support Vector Machines (SVMs) and Conditional Random Fields (CRFs). Stochastic Gradient Descent has been historically associated with back-propagation algorithms in multilayer neural networks. These nonlinear nonconvex problems can be very difficult. Therefore it is useful to see how Stochastic Gradient Descent performs on such simple linear and convex problems. The benchmarks are very clear!
MIT Press has announced the availability of the book Large-Scale Kernel Machines, edited by Léon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston. This book expands the theme of our NIPS 2005 workshop. The book homepage contains useful information. You can even find the complete BibTex file that was used to generate the list of references.
Browsing http://leon.bottou.com now redirects you to this new website. There is still much work to be done.