I started working on structured learning systems in the context of my Ph.D. thesis work on speech recognition. I first focused on combination of time-delay neural networks and dynamic programming techniques. When Bourlard and Wellekens published their first paper combining HMMs and neural networks, I realized that the HMM framework provided much greater opportunities to approach the problem. It also led to the concept of “global training”. This was developed in my 1991 Ph.D. thesis, where I also identified a curious modeling problem that was later termed “the label bias problem”.
Attempts to solve the label bias problem led to the non-probabilistic (LVQ-based) approach described in the very last paragraph of the IJCNN 1991 paper. Meanwhile Burges and Denker solved the probabilistic puzzle in 1994, leading to the document analysis systems and graph transformer network work. To approach this work, I would first recommend reading the 1996 draft which I find much clearer than the published papers (see "Graph Transducer Networks explained".)
My point of view evolved dramatically around 2010 when I started rethinking the connections between structured learning and the emerging deep learning methods. The continuation of this research work can be found in the section on Machine Reasoning and Machine Learning.