CS886 - Assignments
There will be three assignments, each worth 10% of the final
mark. Each assignment must be done individually (i.e., no
team) and will consist entirely of programming questions.
More precisely, you will be asked to program some algorithms for
natural language understanding and to test them on some
datasets.
Assignment 3: due Dec 2nd
In this assignment, you will gain experience with topic
modeling. More precisely, you will experiment with the LDA
model in the Stanford topic modelling toolbox and the NIPS
dataset, which consists of the papers that were published at the
Neural Information Processing Systems (NIPS) conference from 1987
to 1999. Your task is to download the Stanford topic
modeling toolbox and the NIPS dataset, experiment with the
Variational Bayes and Gibbs sampling algorithms, discover the main
topics of the papers published at NIPS and plot the evolution of
those topics over the years. Follow these steps:
- Download and install the Stanford
Topic Modeling Tolbox.
- Get familiar with the toolbox by going through each example
posted on the download page of the toolbox.
- Download the following three versions of the NIPS
dataset. The first file contains the complete text for
each paper published at NIPS from 1987 to 1999. The second
and third files contain respectively the first 10,000 and the
first 2,000 characters of each paper published at NIPS.
Choose a file to do the assignment based on your computer speed
and memory. The topics may be slightly easier to discover
when working with complete papers, but there shouldn't be any
problem to do the assignment with the smaller datasets (your
grade will not depend on which file you use). The files
are in the comma separated format (CSV), which can be read
directly by the toolbox. Each line has three fields: 1) id
(paper #), 2) nipsXX (year in which the paper was published,
e.g., nips00 corresponds to 1987 and nips12 corresponds to 1999)
and 3) word sequence (words of the paper in between double
quotes).
- nipsComplete.csv
- nips10000.csv
- nips2000.csv
- Train an LDA model by Variational Bayes and Gibbs Sampling on
the NIPS dataset with a Scala script similar to example-2-lda-learn.scala.
You will need to change the name of the file and the column in
which the data is located. Compare the running time and accuracy
of Variational Bayes and Gibbs Sampling. For the accuracy,
you cannot rely on the log likelihood printed by the script in
the folder of results since the value printed is often NaN (not
a number). Instead, modify the script to compute
perplexity as done in example-5-lda-select.scala
or simply use example-5-lda-select.scala.
Perplexity is a common measure in natural language processing to
evaluate language models. It indicates how "surprised" the
model is to see each word in a test set. The lower
perplexity is the better. Perplexity is inversely related
to the likelihood of the data (see the wikipedia article on perplexity).
- Using the algorithm selected in the previous step, optimize
the number of topics in the dataset. More specifically,
compute the perplexity of each model trained with 5, 10, 15, 20,
25, 30, 35 and 40 topics and determine which one achieves the
lowest perplexity. Modify the script example-5-lda-select.scala
for this purpose. Ideally this comparison should be done
by k-cross validation with a reasonable k, e.g., 10. If
you are up to the task (this is optional), modify the script example-5-lda-select.scala
to run k-cross validation. Otherwise, simply re-run the
script multiple times and compute the average perplexity across
several runs for each model.
- Using the algorithm and the number of topics determined in the
previous step, re-train LDA and examine the topics
produced. Modify the script example-2-lda-learn.scala
for this purpose. Examine the top words for each topic in
the file summary.txt produced by the script. Based on
those top words, label at least 5 of the topics with a research
subarea. Tip: NIPS is a conference at the intersection of
machine learning and neuroscience with applications in natural
language processing, computer vision and speech recognition. If
you find it difficult to label some of the topics, this is
normal. Not all topics are coherent and knowledge of the
subareas at NIPS really helps. Nevertheless, do your
best.
- Examine the evolution of the topics from 1987 to 1999.
Modify the script example-4-lda-slice.scala
to slice the data by year and then follow the instructions in
the section "Analyzing topic model outputs in Excel" to plot the
evolution of the topics you labeled in the previous step from
1987 to 1999.
Here is what you need to hand in:
- Submit your assignment in PDF format to ppoupart@uwaterloo.ca.
- Training algorithm: report the running time and perplexity of
each algorithm when trained to convergence with the same
parameters. Discuss the results and recommend an
algorithm.
- Number of topics: report the perplexity for models of 5, 10,
15, 20, 25, 30, 35 and 40 topics trained with the algorithm that
you recommended in the previous step. Discuss the results
and recommend a number of topics.
- Labels: re-train LDA with the algorithm and the number of
topics recommended previously. Report the top words for
each topic and label at least 5 topics. Discuss how you
chose the labels.
- Evolution: produce a graph with a curve for each of the topics
that you labeled indicating the proportion of papers associated
with each topic in each year. Discuss the results.
Assignment 2: due Nov 15 Nov 18
In this assignment you will gain experience with named entity
recognition and syntactic chunking (phrase level part-of-speech
tagging). More precisely you will experiment with the HMM
and CRF models and various features on a subset of the CONLL-03
corpus with a natural language processing package called
MALLET. Your task is to download MALLET and the
corpus, create some features, and experiment with the sequence
models in MALLET (CRF and HMM) to label the words of the corpus
with entity types and syntactic chunks. Follow these steps:
- Download and install MALLET.
- Get Familiar with the "sequence tagging" part of MALLET by
reading about command line
interface for sequence tagging and the developer's guide
for sequence tagging. This documentation is very
short and incomplete, but that's all there is.
- Download the files ner.txt and schunk.txt. There is one
word per line followed by a space and its label. Blank
lines indicate the end of a sentence. In ner.txt, the labels are
entity types:
- PER: person
- LOC: location
- ORG: Organization
- MISC: miscellaneous entity
- O: not an entity
In schunk.txt the labels are syntactic chunks, which are
syntactic labels for phrases:
- VP: verb phrase
- NP: noun phrase
- PP: prepositional phrase
- SBAR: subordinate clause
- ADJP: adjective phrase
- ADVP: adverb phrase
- CONJP: conjunction phrase
- PRT: particle
- INTJ: interjection
- LST: list marker
- O: other
- Run Mallet's SimpleTagger on ner.txt and schunk.txt with the
following commands
- java -cp
"/home/username/mallet-2.0.7/class:/home/username/mallet-2.0.7/lib/mallet-deps.jar"
cc.mallet.fst.SimpleTagger --train true --test lab --threads
2 ner.txt
- java -cp
"/home/username/mallet-2.0.7/class:/home/username/mallet-2.0.7/lib/mallet-deps.jar"
cc.mallet.fst.SimpleTagger --train true --test lab --threads
2 schunk.txt
N.B.: The above commands are meant for linux so you may need to
adapt the syntax for other operating systems. Also, make
sure to replace "username" by your username. Malet seems
to be buggy if you use a single thread so make sure to set the
--threads option to at least 2. Mallet will optimize a CRF
model on half of the data and test it on the other half.
If MALLET takes too long, increase the number of threads based
on the number of cores that your computer has. Also, use
the --iterations option to reduce the number of iterations from
the 500 default to something smaller like 50.
- Create some features and add them to the files ner.txt and
schunk.txt. The features should be binary (on or
off). Insert all the features that are on for a word
before its label. The format should be
feature1
feature2 feature3 ... featureN label
For example, if you create the features CAPITAL and
LANDSUFFIX, you can insert them as follows:
James
CAPITAL PER
England
LANDSUFFIX CAPITAL LOC
land
LANDSUFFIX O
food O
In the above example, "James" and "England" are
capitalized which is why CAPITAL is on. Similarly, "England"
and "land" end with the suffix land which is why LANDSUFFIX is
on. Insert only the features that are on before the
label. The word itself is treated as a feature. The
order of the features does not matter. Feel free to add
syntactic chunk as a feature for named entity recognition since
the output of a part-of-speech tagger would usually be used as
input to named entity recognition. The reverse does not make
sense, since the entity types were obtained by manual labeling and
therefore would not be available in practice for a large corpus.
- Check the impact of each feature by running SimpleTagger on
the resulting files and noting how the accuracy improved.
Modify the source code of SimpleTagger to report the precision,
recall and f1 measure for each class. Hint: this can be
done simply by changing TokenAccuracyEvaluator to
PerClassAccuracyEvaluator.
- Experiment with the --orders option, which defines the order
of the Markov chain.
- Modify the source code of SimpleTagger to experiment with an
HMM instead of a CRF.
Here is what you need to hand in:
- Submit your assignment in PDF format to ppoupart@uwaterloo.ca.
- Features:
- Describe at least 6 features that you created and
experimented with. You may use the same features or different
features for syntactic chunking and named entity
recognition. Explain how each feature is relevant to
syntactic chunking and/or named entity recognition. Tip:
feel free to use the features we discussed in class or the
features used by the algorithms that were entered in the CONLL-03
shared task.
- Recommend a set of features for syntactic chunking and
report the precision, recall and f1 measure for each syntactic
chunk label (use SimpleTagger's default training scheme where
it trains on half of the dataset and tests on the other
half). Discuss the results. Do the same for named
entity recognition. Tip: to get a sense of how good your
tagger is, check the performance of the algorithms that were
entered in the CONLL-03
shared task. NB: you do not need to match or beat
the results achieved by those algorithms to get a good
mark.
- Model and order of the Markov chain:
- Using the set of features recommended in the previous
section, report the precision, recall and f1 measure for
syntactic chunking and named entity recognition when using an
HMM vs a CRF and when varying the order of the Markov
chain. More precisely, set the --orders option to 0
(independent classification), 1 (joint classification with
correlations between adjacent labels) and 1,2 (joint
classification with correlation between labels hat are 1 or 2
words away from each other). When --orders is set to 0,
the CRF becomes a logistic regression model and similarly, the
HMM becomes a naive Bayes model. Stick to the default
trainers for HMMs (maximum likelihood: HMMTrainerByLikelihood)
and CRFs (maximum conditional likelihood:
CRFTrainerByLabelLikelihood). Discuss the impact of the
order of the Markov chain on the results. Discuss the
impact of the model (and the corresponding default training
method) on the results. Which order and which model
would you recommend for syntactic chunking and named entity
recognition.
Assignment 1: due Oct 23
In this assignment, you will gain some experience with text
categorization. More precisely, you will test several
classification algorithms (naive Bayes, SVM and decision tree) on
a corpus of financial new articles (Reuters-2158) with a data
mining package called WEKA. Your task is to download WEKA
and the corpus, then experiment with various pre-processing
techniques to transform the documents into a vector space
representation and finally experiment with the classification
algorithms. Follow these steps:
- Download
and install WEKA.
- Get familiar with WEKA by going through a tutorial
and reading the user manual (file weka-manual.pdf in the root
directory of the WEKA package).
- Download the
corpus Reuters21578-Apte-10Cat.arff. It consists of
9981 news articles published by Reuters in 1987 in 10 different
categories (crude, acq, interest, grain, earn, money-fx, trade,
wheat, corn, ship). The corpus is already in the arff
format which can be directly loaded into WEKA. You can
also open the corpus with a text editor to see what the news
articles look like.
- Launch WEKA and start "explorer"
- Load the corpus into WEKA by cliking on the "preprocess" tab,
followed by the "open file" button.
- Convert the corpus into a vector space representation by
applying a filter. Click on the button "choose" under
filter. Select
"weka"-->"filters"-->"unsupervised"-->"attribute"-->"StringToWordVector".
Edit the properties of the filter StringToWordVector.
Experiment with different values for the following properties:
IDFTransform, TFTransform, lowerCaseTokens, minTermFreq,
stemmer, tokenizer and useStopList. All other properties
should remain with their default values. Click on the
button "more" for a brief explanation of the filter and its
properties. Once the properties are configured click
on"apply" to run the filter.
- Click on the tab "classify" to run a classifier. Make
sure that the class attribute is "class" otherwise some
classifiers may not be available and the results would be wrong
anyway. Experiment with the following classifiers:
naive-Bayes model
("weka"-->"classifier"-->"bayes"-->"naiveBayesMultinomial"),
support vector machine
("weka"-->"classifier"-->"functions"-->"SMO") and
decision trees
("weka"-->"classifier"-->"tree"-->"J48"). See the
instructions below for the parameters of each classifier that
you shold experiment with. Compare the results when you
train and test with the same data and when you do a 10-fold
cross validation.
What to hand in:
- Submit your assignment electronically to ppoupart@uwaterloo.ca
- Preprocessing: Describe in your own words how the
properties IDFTransform, TFTransform, lowerCaseTokens,
minTermFreq, stemmer, tokenizer and useStopList affect the
filter StringToWordVector. For each property, indicate how
different values impact the resulting vector space
representation. Also indicate which joint configuration of
the properties is the most suitable for text categorization in
your opinion.
- Classification: After processing the corpus with the
StringToWordVector based on the joint configuration of
properties that you recommended, experiment with
naiveBayesMultinomial, SMO (Support vector machine) and J48
(decision tree). Report the following results:
- How much overfitting (difference between accuracy based on
10-fold cross validation and testing with the training set)
occurs for each classifier when using its default
configuration? Briefly explain.
- Describe in your own words the effect of the following
parameters for J48 and SMO, and recommend a joint
configuration of the parameters
- J48: unpruned, minNumObj
- SMO: polyKernel-->exponent, RBFKernel-->gamma
- Print the output of each classifier (naiveBayesMultinomial,
J48 and SMO) for the parametrization that you recommended in
the previous question. Which classifier is best when
comparing accuracy and time? Explain briefly.
NB: SMO and J48 may be slow depending on the machine you use and the
parametrization selected. If it is taking too long to run
them, reduce the size of the dataset by running the filter
"unsupervised"-->"instance"-->"resample". Use this
filter to sample (without replacement) a subset of the corpus that
allows you to run your experiments at a decent pace. Report
the size of your sample in your assignment. You can also
use LibSVM instead of SMO since LibSVM is faster than SMO, however
you may have to download LibSVM separately if it didn't come with
WEKA.