Melanie Tosik

How to get started in NLP

Somewhere I read that if you ever have to answer the same question twice, it’s probably a good idea to turn it into a blog post. In keeping with this rule and to save my future self some time, here now my standard answer to the question: “My background is in * science, and I’m interested in learning NLP. Where do I start?”

Before you dive in, please note that the list below is really just a very general starting point (and likely incomplete). To help navigate the flood of information, I added short descriptions and difficulty estimates in brackets. Basic programming skills (e.g. in Python) are recommended.

Online courses

  • Dan Jurafsky & Chris Manning: Natural Language Processing [great intro video series]
  • Stanford CS224d: Deep Learning for Natural Language Processing [more advanced ML algorithms, deep learning, and NN architectures for NLP]
  • Coursera: Introduction to Natural Language Processing [intro NLP course offered by the University of Michigan]

Libraries and open source

  • spaCy (website, blog) [Python; emerging open-source library with fantastic usage examples, API documentation, and demo applications]
  • Natural Language Toolkit (NLTK) (website, book) [Python; practical intro to programming for NLP, mainly used for teaching]
  • Stanford CoreNLP (website) [Java, high-quality analysis toolkit]

Active blogs

  • natural language processing blog (Hal Daumé)
  • Google Research blog
  • Language Log (Mark Liberman)

Books

  • Speech and Language Processing (Daniel Jurafsky and James H. Martin) [classic NLP textbook that covers all the basics, 3rd edition coming soon]
  • Foundations of Statistical Natural Language Processing (Chris Manning and Hinrich Schütze) [more advanced, statistical NLP methods]
  • Introduction to Information Retrieval (Chris Manning, Prabhakar Raghavan and Hinrich Schütze) [excellent reference on ranking/search]
  • Neural Network Methods in Natural Language Processing (Yoav Goldberg) [deep intro to NN approaches to NLP, primer here]

Miscellaneous

  • How to build a word2vec model in TensorFlow [tutorial]
  • Deep Learning for NLP resources [overview of state-of-the-art resources for deep learning, organized by topic]
  • Last Words: Computational Linguistics and Deep Learning — A look at the importance of Natural Language Processing. (Chris Manning) [article]
  • Natural Language Understanding with Distributed Representation (Kyunghyun Cho) [self-contained lecture note on ML/NN approaches to NLU]
  • Bayesian Inference with Tears (Kevin Knight) [tutorial workbook]
  • Association for Computational Linguistics (ACL) [journal anthology]
  • Quora: How do I learn Natural Language Processing?

DIY projects and data sets

A thorough list of publicly available NLP data sets has already been created by Nicolas Iderhoff. Beyond these, here are some projects I can recommend to any NLP novice wanting to get their hands dirty:

  • Implement a part-of-speech (POS) tagger based on a hidden Markov model (HMM)
  • Implement the CYK algorithm for parsing context-free grammars
  • Implement semantic similarity between two given words in a collection of text, e.g. pointwise mutual information (PMI)
  • Implement a Naive Bayes classifier to filter spam
  • Implement a spell checker based on edit distances between words
  • Implement a Markov chain text generator
  • Implement a topic model using latent Dirichlet allocation (LDA)
  • Use word2vec to generate word embeddings from a large text corpus, e.g. Wikipedia

NLP on social media

  • Twitter: #nlproc, list of NLPers (by Jason Baldrige)
  • Reddit: /r/LanguageTechnology
  • Medium: Nlp

Reach out on Twitter @meltomene