User Tools

Site Tools


Natural Language Processing (Almost) from Scratch

Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.

Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu and Pavel Kuksa: Natural Language Processing (Almost) from Scratch, Journal of Machine Learning Research, 12:2493–2537, Aug 2011.

JMLR Link    jmlr-2011.djvu jmlr-2011.pdf jmlr-2011.ps.gz

@article{collobert-2011,
  author = {Collobert, Ronan and Weston, Jason and Bottou, L\'eon and Karlen, Michael and Kavukcuoglu, Koray and Kuksa, Pavel},
  title = {Natural Language Processing (Almost) from Scratch},
  journal = {Journal of Machine Learning Research},
  year = {2011},
  volume = {12},
  pages = {2493--2537},
  month = {Aug},
  url = {http://leon.bottou.org/papers/collobert-2011},
}

Senna

The universal natural language tagger described in this paper is actively maintained by Ronan Collobert. It can be downloaded from Senna. Besides part-of-speech tagging, chunking, named entity extraction, and semantic role labelling, the latest version also outputs syntactic parse trees, still using the neural network architecture described in this paper.

  • State-of-the-art or near-state-of-the-art tagging accuracies.
  • Exceptional tagging speed (POS+NER+Chunk+SRL+Parse at more than 10000 words per second.)
  • Small memory footprint (about 120MB.)
  • Compact soure code (about 3000 lines of C.)
papers/collobert-2011.txt · Last modified: 2011/10/08 23:21 by leonb

Page Tools