CS886 Fall07 - Syllabus: Statistical Techniques for Natural Language
Processing and Information Retrieval
Objectives
Given the abondance of text on the web, there is an increasing need for
automated techniques to process, understand and retrieve information
from text. Unlike traditional databases, natural text is fairly
unstructured. In addition, the meaning and relation of words,
phrases and sentences are inherently ambiguous. To that effect,
probabilities and statistics provide natural tools to cope with
uncertainty. Furthermore, statistical techniques can take
advantage of the abondance of text to verify various hypotheses.
This course focuses on the algorithmic and theoretical foundations of
machine learning and computational statistics that are relevant to
natural language processing (NLP) and information retrieval (IR).
The techniques and algorithms covered will be of general interest and
therefore applicable to other domains beyond NLP and IR. Special
attention will be paid to provide a unifying view of the machinery
provided by machine learning and computational statistics, which is not
always obvious when focusing on application domains such as NLP and
IR. This course should interest researchers in natural language
processing, information retrieval, data mining, machine learning,
computational statistics, reasoning under uncertainty, as well as any
other field where statistical techniques may be used (e.g.,
computational vision, bio-informatics, health informatics,
self-management systems, computational finance). No prior
knowledge in machine learning, computational statistics, natural
language processing or information retrieval is required.
References
We will make use of three textbooks with additional readings
from selected research papers. The first and third textbooks are
available
online.
Outline
The statistical techniques we will
cover include:
- Basics of probability theory, machine learning and statistics
- Vector space model
- Naive Bayes model
- TF-IDF (term frequency - inverse document frequency)
- Latent semantic analysis
- LSI (latent semantic indexing)
- LDA (latent Dirichlet allocation)
- Classification, regression and feature selection
- Support vector machines
- Logistic regression
- Boosting
- Markov models
- N-gram models
- Hidden Markov models
- Markov random fields
- Maximum entropy models
- Conditional random fields
- Statistical relational learning
Natural language processing and
information retrieval applications:
- Part-of-speech tagging
- Parsing
- Word sense disambiguation
- Text categorization/clustering
- Topic modeling
- Named entity recognition
- Coreference resolution
- Relation detection
- Event detection
- Question answering