CS886 - Assignments

There will be three assignments, each worth 10% of the final mark. Each assignment must be done individually (i.e., no team) and will consist entirely of programming questions. More precisely, you will be asked to program some algorithms for natural language understanding and to test them on some datasets.

Assignment 3: due Dec 2nd

In this assignment, you will gain experience with topic modeling. More precisely, you will experiment with the LDA model in the Stanford topic modelling toolbox and the NIPS dataset, which consists of the papers that were published at the Neural Information Processing Systems (NIPS) conference from 1987 to 1999. Your task is to download the Stanford topic modeling toolbox and the NIPS dataset, experiment with the Variational Bayes and Gibbs sampling algorithms, discover the main topics of the papers published at NIPS and plot the evolution of those topics over the years. Follow these steps:

Download and install the Stanford Topic Modeling Tolbox.
Get familiar with the toolbox by going through each example posted on the download page of the toolbox.
Download the following three versions of the NIPS dataset. The first file contains the complete text for each paper published at NIPS from 1987 to 1999. The second and third files contain respectively the first 10,000 and the first 2,000 characters of each paper published at NIPS. Choose a file to do the assignment based on your computer speed and memory. The topics may be slightly easier to discover when working with complete papers, but there shouldn't be any problem to do the assignment with the smaller datasets (your grade will not depend on which file you use). The files are in the comma separated format (CSV), which can be read directly by the toolbox. Each line has three fields: 1) id (paper #), 2) nipsXX (year in which the paper was published, e.g., nips00 corresponds to 1987 and nips12 corresponds to 1999) and 3) word sequence (words of the paper in between double quotes).

Train an LDA model by Variational Bayes and Gibbs Sampling on the NIPS dataset with a Scala script similar to example-2-lda-learn.scala. You will need to change the name of the file and the column in which the data is located. Compare the running time and accuracy of Variational Bayes and Gibbs Sampling. For the accuracy, you cannot rely on the log likelihood printed by the script in the folder of results since the value printed is often NaN (not a number). Instead, modify the script to compute perplexity as done in example-5-lda-select.scala or simply use example-5-lda-select.scala. Perplexity is a common measure in natural language processing to evaluate language models. It indicates how "surprised" the model is to see each word in a test set. The lower perplexity is the better. Perplexity is inversely related to the likelihood of the data (see the wikipedia article on perplexity).
Using the algorithm selected in the previous step, optimize the number of topics in the dataset. More specifically, compute the perplexity of each model trained with 5, 10, 15, 20, 25, 30, 35 and 40 topics and determine which one achieves the lowest perplexity. Modify the script example-5-lda-select.scala for this purpose. Ideally this comparison should be done by k-cross validation with a reasonable k, e.g., 10. If you are up to the task (this is optional), modify the script example-5-lda-select.scala to run k-cross validation. Otherwise, simply re-run the script multiple times and compute the average perplexity across several runs for each model.
Using the algorithm and the number of topics determined in the previous step, re-train LDA and examine the topics produced. Modify the script example-2-lda-learn.scala for this purpose. Examine the top words for each topic in the file summary.txt produced by the script. Based on those top words, label at least 5 of the topics with a research subarea. Tip: NIPS is a conference at the intersection of machine learning and neuroscience with applications in natural language processing, computer vision and speech recognition. If you find it difficult to label some of the topics, this is normal. Not all topics are coherent and knowledge of the subareas at NIPS really helps. Nevertheless, do your best.
Examine the evolution of the topics from 1987 to 1999. Modify the script example-4-lda-slice.scala to slice the data by year and then follow the instructions in the section "Analyzing topic model outputs in Excel" to plot the evolution of the topics you labeled in the previous step from 1987 to 1999.

Here is what you need to hand in:

Submit your assignment in PDF format to ppoupart@uwaterloo.ca.
Training algorithm: report the running time and perplexity of each algorithm when trained to convergence with the same parameters. Discuss the results and recommend an algorithm.
Number of topics: report the perplexity for models of 5, 10, 15, 20, 25, 30, 35 and 40 topics trained with the algorithm that you recommended in the previous step. Discuss the results and recommend a number of topics.
Labels: re-train LDA with the algorithm and the number of topics recommended previously. Report the top words for each topic and label at least 5 topics. Discuss how you chose the labels.
Evolution: produce a graph with a curve for each of the topics that you labeled indicating the proportion of papers associated with each topic in each year. Discuss the results.

Assignment 2: due ~~Nov 15~~ Nov 18

In this assignment you will gain experience with named entity recognition and syntactic chunking (phrase level part-of-speech tagging). More precisely you will experiment with the HMM and CRF models and various features on a subset of the CONLL-03 corpus with a natural language processing package called MALLET. Your task is to download MALLET and the corpus, create some features, and experiment with the sequence models in MALLET (CRF and HMM) to label the words of the corpus with entity types and syntactic chunks. Follow these steps:

Download and install MALLET.
Get Familiar with the "sequence tagging" part of MALLET by reading about command line interface for sequence tagging and the developer's guide for sequence tagging. This documentation is very short and incomplete, but that's all there is.
Download the files ner.txt and schunk.txt. There is one word per line followed by a space and its label. Blank lines indicate the end of a sentence. In ner.txt, the labels are entity types:

PER: person
LOC: location
ORG: Organization
MISC: miscellaneous entity
O: not an entity

In schunk.txt the labels are syntactic chunks, which are syntactic labels for phrases:

VP: verb phrase
NP: noun phrase
PP: prepositional phrase
SBAR: subordinate clause
ADJP: adjective phrase
ADVP: adverb phrase
CONJP: conjunction phrase
PRT: particle
INTJ: interjection
LST: list marker

O: other

Run Mallet's SimpleTagger on ner.txt and schunk.txt with the following commands
- java -cp "/home/username/mallet-2.0.7/class:/home/username/mallet-2.0.7/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --test lab --threads 2 ner.txt
- java -cp "/home/username/mallet-2.0.7/class:/home/username/mallet-2.0.7/lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --test lab --threads 2 schunk.txt
N.B.: The above commands are meant for linux so you may need to adapt the syntax for other operating systems. Also, make sure to replace "username" by your username. Malet seems to be buggy if you use a single thread so make sure to set the --threads option to at least 2. Mallet will optimize a CRF model on half of the data and test it on the other half. If MALLET takes too long, increase the number of threads based on the number of cores that your computer has. Also, use the --iterations option to reduce the number of iterations from the 500 default to something smaller like 50.
Create some features and add them to the files ner.txt and schunk.txt. The features should be binary (on or off). Insert all the features that are on for a word before its label. The format should be

          feature1 feature2 feature3 ... featureN label

For example, if you create the features CAPITAL and LANDSUFFIX, you can insert them as follows:

          James CAPITAL PER
          England LANDSUFFIX CAPITAL LOC
          land LANDSUFFIX O
          food O

In the above example, "James" and "England" are capitalized which is why CAPITAL is on. Similarly, "England" and "land" end with the suffix land which is why LANDSUFFIX is on. Insert only the features that are on before the label. The word itself is treated as a feature. The order of the features does not matter. Feel free to add syntactic chunk as a feature for named entity recognition since the output of a part-of-speech tagger would usually be used as input to named entity recognition. The reverse does not make sense, since the entity types were obtained by manual labeling and therefore would not be available in practice for a large corpus.

Check the impact of each feature by running SimpleTagger on the resulting files and noting how the accuracy improved. Modify the source code of SimpleTagger to report the precision, recall and f1 measure for each class. Hint: this can be done simply by changing TokenAccuracyEvaluator to PerClassAccuracyEvaluator.
Experiment with the --orders option, which defines the order of the Markov chain.
Modify the source code of SimpleTagger to experiment with an HMM instead of a CRF.

Here is what you need to hand in:

Submit your assignment in PDF format to ppoupart@uwaterloo.ca.
Features:

Describe at least 6 features that you created and experimented with. You may use the same features or different features for syntactic chunking and named entity recognition. Explain how each feature is relevant to syntactic chunking and/or named entity recognition. Tip: feel free to use the features we discussed in class or the features used by the algorithms that were entered in the CONLL-03 shared task.

Recommend a set of features for syntactic chunking and report the precision, recall and f1 measure for each syntactic chunk label (use SimpleTagger's default training scheme where it trains on half of the dataset and tests on the other half). Discuss the results. Do the same for named entity recognition. Tip: to get a sense of how good your tagger is, check the performance of the algorithms that were entered in the CONLL-03 shared task. NB: you do not need to match or beat the results achieved by those algorithms to get a good mark.

Model and order of the Markov chain:

Using the set of features recommended in the previous section, report the precision, recall and f1 measure for syntactic chunking and named entity recognition when using an HMM vs a CRF and when varying the order of the Markov chain. More precisely, set the --orders option to 0 (independent classification), 1 (joint classification with correlations between adjacent labels) and 1,2 (joint classification with correlation between labels hat are 1 or 2 words away from each other). When --orders is set to 0, the CRF becomes a logistic regression model and similarly, the HMM becomes a naive Bayes model. Stick to the default trainers for HMMs (maximum likelihood: HMMTrainerByLikelihood) and CRFs (maximum conditional likelihood: CRFTrainerByLabelLikelihood). Discuss the impact of the order of the Markov chain on the results. Discuss the impact of the model (and the corresponding default training method) on the results. Which order and which model would you recommend for syntactic chunking and named entity recognition.

Assignment 1: due Oct 23

In this assignment, you will gain some experience with text categorization. More precisely, you will test several classification algorithms (naive Bayes, SVM and decision tree) on a corpus of financial new articles (Reuters-2158) with a data mining package called WEKA. Your task is to download WEKA and the corpus, then experiment with various pre-processing techniques to transform the documents into a vector space representation and finally experiment with the classification algorithms. Follow these steps:

Download and install WEKA.
Get familiar with WEKA by going through a tutorial and reading the user manual (file weka-manual.pdf in the root directory of the WEKA package).
Download the corpus Reuters21578-Apte-10Cat.arff. It consists of 9981 news articles published by Reuters in 1987 in 10 different categories (crude, acq, interest, grain, earn, money-fx, trade, wheat, corn, ship). The corpus is already in the arff format which can be directly loaded into WEKA. You can also open the corpus with a text editor to see what the news articles look like.
Launch WEKA and start "explorer"
Load the corpus into WEKA by cliking on the "preprocess" tab, followed by the "open file" button.
Convert the corpus into a vector space representation by applying a filter. Click on the button "choose" under filter. Select "weka"-->"filters"-->"unsupervised"-->"attribute"-->"StringToWordVector". Edit the properties of the filter StringToWordVector. Experiment with different values for the following properties: IDFTransform, TFTransform, lowerCaseTokens, minTermFreq, stemmer, tokenizer and useStopList. All other properties should remain with their default values. Click on the button "more" for a brief explanation of the filter and its properties. Once the properties are configured click on"apply" to run the filter.
Click on the tab "classify" to run a classifier. Make sure that the class attribute is "class" otherwise some classifiers may not be available and the results would be wrong anyway. Experiment with the following classifiers: naive-Bayes model ("weka"-->"classifier"-->"bayes"-->"naiveBayesMultinomial"), support vector machine ("weka"-->"classifier"-->"functions"-->"SMO") and decision trees ("weka"-->"classifier"-->"tree"-->"J48"). See the instructions below for the parameters of each classifier that you shold experiment with. Compare the results when you train and test with the same data and when you do a 10-fold cross validation.

What to hand in:

Submit your assignment electronically to ppoupart@uwaterloo.ca
Preprocessing: Describe in your own words how the properties IDFTransform, TFTransform, lowerCaseTokens, minTermFreq, stemmer, tokenizer and useStopList affect the filter StringToWordVector. For each property, indicate how different values impact the resulting vector space representation. Also indicate which joint configuration of the properties is the most suitable for text categorization in your opinion.
Classification: After processing the corpus with the StringToWordVector based on the joint configuration of properties that you recommended, experiment with naiveBayesMultinomial, SMO (Support vector machine) and J48 (decision tree). Report the following results:

How much overfitting (difference between accuracy based on 10-fold cross validation and testing with the training set) occurs for each classifier when using its default configuration? Briefly explain.
Describe in your own words the effect of the following parameters for J48 and SMO, and recommend a joint configuration of the parameters

J48: unpruned, minNumObj
SMO: polyKernel-->exponent, RBFKernel-->gamma

Print the output of each classifier (naiveBayesMultinomial, J48 and SMO) for the parametrization that you recommended in the previous question. Which classifier is best when comparing accuracy and time? Explain briefly.

NB: SMO and J48 may be slow depending on the machine you use and the parametrization selected. If it is taking too long to run them, reduce the size of the dataset by running the filter "unsupervised"-->"instance"-->"resample". Use this filter to sample (without replacement) a subset of the corpus that allows you to run your experiments at a decent pace. Report the size of your sample in your assignment. You can also use LibSVM instead of SMO since LibSVM is faster than SMO, however you may have to download LibSVM separately if it didn't come with WEKA.