I am a research scientist with broad interests in practical and theoretical machine learning. My work on large scale learning and stochastic gradient algorithms has received attention in the recent years. I am also known for the DjVu document compression system. I joined Facebook AI Research in March 2015.
Use the sidebar to navigate this site.
Alex Peysakhovich and I represent Facebook on the organizing committee of the NYC Data Science Seminar Series. This rotating seminar organized by Columbia, CornellTech, Facebook, Microsoft Research NYC, and New York University has featured a number of prominent speakers. Although there is a nice web site and a Facebook page, the web site does not yet show prominently in search engines. May be because it uses the nyc toplevel domain? Anyway, this is why I am posting this little blurb with a lots of links. The URL is http://datascienceseminar.nyc.
I was scavenging my old emails a couple weeks ago and found a copy of an early technical report that not only describes Graph Transformer Networks in a couple pages but also explains why they are defined the way they are.
Although Graph Transformer Networks have been introduced twenty years ago, they are considerably more powerful than most structured output machine learning methods. Not only do they handle the label bias problem as well as CRFs, but their hierarchical and modular structure lends itself to many refinements: they can be trained with weak supervision; they can handle pruned search strategies and adapt training to make the pruning work better; and they also provide a proven framework to reuse existing code and heuristics.
Graph transformer networks were also very successful in the real world. They have been used in commercially deployed check reading machines for more than a decade, processing about one billion checks per year. Unfortunately they are described in hard-to-read papers, such as (Bottou et al., CVPR 1997)) or such as the rarely read second half of (LeCun et al., IEEE 1998).
This tech report is now available on my web site.
Why settle for 60000 MNIST training examples when you can have one trillion?
The MNIST8M dataset was generated using the elastic deformation code originally written for (Loosli, Canu, and Bottou, 2007). Unfortunately the original MNIST8M files were accidentally deleted from the NEC servers a couple weeks ago. Instead of regenerating the files, I have repackaged the generation code in a convenient form. You can now generate arbitrary amounts of pseudo-random MNIST training examples. You can even use this code to generate your training data on the fly. We call this the infinite MNIST dataset.
Our paper “Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising” has appeared in JMLR. This paper takes the example of ad placement to illustrate how one can leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select changes that improve both the short-term and long-term performance. In particular, the paper demonstrates the connection between the classic explore–exploit and correlation–causation issues in machine learning and statistics.