CS886 Fall07 - Syllabus: Statistical Techniques for Natural Language Processing and Information Retrieval

Objectives

Given the abondance of text on the web, there is an increasing need for automated techniques to process, understand and retrieve information from text. Unlike traditional databases, natural text is fairly unstructured. In addition, the meaning and relation of words, phrases and sentences are inherently ambiguous. To that effect, probabilities and statistics provide natural tools to cope with uncertainty. Furthermore, statistical techniques can take advantage of the abondance of text to verify various hypotheses.

This course focuses on the algorithmic and theoretical foundations of machine learning and computational statistics that are relevant to natural language processing (NLP) and information retrieval (IR). The techniques and algorithms covered will be of general interest and therefore applicable to other domains beyond NLP and IR. Special attention will be paid to provide a unifying view of the machinery provided by machine learning and computational statistics, which is not always obvious when focusing on application domains such as NLP and IR. This course should interest researchers in natural language processing, information retrieval, data mining, machine learning, computational statistics, reasoning under uncertainty, as well as any other field where statistical techniques may be used (e.g., computational vision, bio-informatics, health informatics, self-management systems, computational finance). No prior knowledge in machine learning, computational statistics, natural language processing or information retrieval is required.

References

We will make use of three textbooks with additional readings from selected research papers. The first and third textbooks are available online.

Jurafsky and Martin (2008), Speech and Language Processing, 2nd edition
Manning and Schutze (1999), Foundations of Statistical Natural Language Processing
Manning, Raghavan and Schutze (2008), Introduction to Information Retrieval

Outline

The statistical techniques we will cover include:

Basics of probability theory, machine learning and statistics

Inference
Generalization

Vector space model

Naive Bayes model
TF-IDF (term frequency - inverse document frequency)

Latent semantic analysis

LSI (latent semantic indexing)
LDA (latent Dirichlet allocation)

Classification, regression and feature selection

Support vector machines
Logistic regression
Boosting

Markov models

N-gram models
Hidden Markov models
Markov random fields

Maximum entropy models

Conditional random fields

Statistical relational learning

Markov logic networks

Natural language processing and information retrieval applications:

Part-of-speech tagging
Parsing
Word sense disambiguation
Text categorization/clustering
Topic modeling
Named entity recognition
Coreference resolution
Relation detection
Event detection
Question answering