I was scavenging my old emails a couple weeks ago and found a copy of an early technical report that not only describes Graph Transformer Networks in a couple pages but also explains why they are defined the way they are.
Although Graph Transformer Networks have been introduced twenty years ago, they are considerably more powerful than most structured output machine learning methods. Not only do they handle the label bias problem as well as CRFs, but their hierarchical and modular structure lends itself to many refinements: they can be trained with weak supervision; they can handle pruned search strategies and adapt training to make the pruning work better; and they also provide a proven framework to reuse existing code and heuristics.
Graph transformer networks were also very successful in the real world. They have been used in commercially deployed check reading machines for more than a decade, processing about one billion checks per year. Unfortunately they are described in hard-to-read papers, such as (Bottou et al., CVPR 1997)) or such as the rarely read second half of (LeCun et al., IEEE 1998).
This tech report is now available on my web site.