CS886 Fall07 - Syllabus: Statistical Techniques for Natural Language Processing and Information Retrieval


Objectives

Given the abondance of text on the web, there is an increasing need for automated techniques to process, understand and retrieve information from text.  Unlike traditional databases, natural text is fairly unstructured.  In addition, the meaning and relation of words, phrases and sentences are inherently ambiguous.  To that effect, probabilities and statistics provide natural tools to cope with uncertainty.  Furthermore, statistical techniques can take advantage of the abondance of text to verify various hypotheses.

This course focuses on the algorithmic and theoretical foundations of machine learning and computational statistics that are relevant to natural language processing (NLP) and information retrieval (IR).  The techniques and algorithms covered will be of general interest and therefore applicable to other domains beyond NLP and IR.  Special attention will be paid to provide a unifying view of the machinery provided by machine learning and computational statistics, which is not always obvious when focusing on application domains such as NLP and IR.  This course should interest researchers in natural language processing, information retrieval, data mining, machine learning, computational statistics, reasoning under uncertainty, as well as any other field where statistical techniques may be used (e.g., computational vision, bio-informatics, health informatics, self-management systems, computational finance).  No prior knowledge in machine learning, computational statistics, natural language processing or information retrieval is required.



References

We will make use of three textbooks with additional readings from selected research papers.  The first and third textbooks are available online.

Outline

The statistical techniques we will cover include:

  1. Basics of probability theory, machine learning and statistics
  2. Vector space model
  3. Latent semantic analysis
  4. Classification, regression and feature selection
  5. Markov models
  6. Maximum entropy models
  7. Statistical relational learning
Natural language processing and information retrieval applications:
  1. Part-of-speech tagging
  2. Parsing
  3. Word sense disambiguation
  4. Text categorization/clustering
  5. Topic modeling
  6. Named entity recognition
  7. Coreference resolution
  8. Relation detection
  9. Event detection
  10. Question answering