CS480/680 Spring 2019 - Introduction to Machine Learning

There will be five assignments, each worth 8% of the final mark (6% for CS680). Assignments are done individually (i.e., no team). Each assignment will have a theoretical part and a programming part. Some assignments may make use of TensorFlow or PyTorch. For GPU acceleration, feel free to use Google's Colaboratory environment. This is a free cloud service where you can run Python code (including TensorFlow, which is pre-installed) with GPU acceleration. A virtual machine with two CPUs and one Nvidia K80 GPU will run up to 12 hours after which it must be restarted. The following steps are recommended:

Create a Python notebook in Google Colab
Click on "edit", then "notebook settings" and select "None" (CPU) or "GPU" for hardware acceleration.

The approximate out and due dates are:

A1: out May 13, due May 24 (11:59 pm)
A2: out May 29, due June 10 (11:59 pm)
A3: out June 12, due June 30 (11:59 pm)
A4: out July 2, due July 12 (11:59 pm)
A5: out July 15, due July 30 (11:59 pm)

On the due date of an assignment, the work done to date should be submitted electronically on the LEARN website; further material may be submitted with a 2% penalty for every rounded up hour past the deadline. For example, an assignment submitted 5 hours and 15 min late will receive a penalty of ceiling(5.25) * 2% = 12%. Assignments submitted more than 50 hours late will not be marked.

Assignment 1: due May 24 (11:59 pm)

Assignment handout with latex source
Dataset for K-nearest neighbour: knn-dataset.zip
- Problem: this data is a modified version of the Optical Recognition of Handwritten Digits Dataset from the UCI repository. It contains pre-processed black and white images of the digits 5 and 6. Each attribute indicates how many pixels are black in a patch of 4 x 4 pixels.
- Format: there is one row per image and one column per attribute. The class labels are 5 and 6.
- The training set is already divided into 10 subsets for 10-fold cross validation.
Dataset for linear regression: regression-dataset.zip
- Problem: this data corresponds to samples from a 2D surface that you can plot to visualize how linear regression is working.
- Format: there is one row per data instance and one column per attribute. The targets are real values.
- The training set is already divided into 10 subsets for 10-fold cross validation.
Priyank Jaini (pjaini [at] uwaterloo [dot] ca) and Chengyao Fu (c36fu [at] Waterloo [dot] ca) are the TAs responsible for A1. They will hold special office hours to answer questions about A1 on Wednesday May 22, 2-4pm in the AI lab (DC2306C).

Assignment 2: due June 10 (11:59 pm)

Assignment handout with latex source
Dataset: use the dataset for K-nearest neighbours from Assignment 1.
Zeou Hu (z97hu [at] uwaterloo [dot] ca) and Amir Farrag (a2farrag [at] uwaterloo [dot] ca) are the TAs responsible for A2. They will hold special office hours to answer questions about A2 on Wednesday June 5, 2-4 pm in the AI lab (DC2306C). Pascal also holds regular office hours every Wednesday 10:30-11:20 am and Friday 10-10:50 am in DC2514.

Assignment 3: due June 30 (11:59 pm)

Assignment handout with latex source
Dataset for nonlinear regression: nonlinear-regression-dataset.zip
- Problem: this data corresponds to samples from a 2D surface that you can plot to visualize how regression is working.
- Format: there is one row per data instance and one column per attribute. The targets are real values.
- Visualization: 3D view of the data.
Chengyao Fu (c36fu [at] uwaterloo [dot] ca) and Ashutosh Devendrakumar Adhikari (adadhika [at] uwaterloo [dot] ca) are the TAs responsible for A3. They will hold special office hours to answer questions about A3 on June 26, 2-4 pm in the AI lab (DC2306C).

Assignment 4: due July 12 (11:59 pm)

Assignment handout with latex source
cifar10_cnn_cs480.py
Amir Farrag (a2farrag [at] uwaterloo [dot] ca) and Priyansh Narang (p2narang [at] Waterloo [dot] ca) are the TAs responsible for A4. They will hold special office hours to answer questions about A4 on July 10, 2-4 pm in the AI lab (DC2306C).

Assignment 5: due July 30 (11:59 pm)

Assignment handout with latex source
Code for Q1a: cs480_char_rnn_classification_tutorial.ipynb or cs480_char_rnn_classification_tutorial.py
Code for Q1b: cs480_char_rnn_generation_tutorial.ipynb or cs480_char_rnn_generation_tutorial.py
Code for Q1c: cs480_seq2seq_translation_tutorial.ipynb or cs480_seq2seq_translation_tutorial.py
Code for Q2: cs480_leadersprize.ipynb or cs480_leadersprize.py
Instructions for Question 2:
- 2a: After downloading the data from the DataCup competition website and reading the data documentation, modify the code in cs480_leadersprize.ipynb to produce a classification of each claim that depends on (i) the claim only, (ii) the claim and the claimant only and (iii) the claim, the claimant and the 5 sentences from the related articles that are most similar to the claim. For (i), you can simply run the code in cs480_leadersprize.ipynb without any modification. For (ii), concatenate the claim and the claimant in one text sequence that is fed to the RNN. For (iii), concatenate the claim, claimant and the 5 sentences produced by the function preprocessed_articles() in one text sequence. Feel free to run your code for as many iterations as you think is appropriate. Hand in
  - Your code
  - Two graphs where the y axis is the classification accuracy and the x axis is the number of iterations. The first graph will have three curves for the training accuracy of (i), (ii) and (iii). The second graph will have three curves for the testing accuracy of (i), (ii) and (iii).
  - Discuss the results, i.e., why does the inclusion of some information improve or reduce the classification accuracy?
- 2b: Read the following blog: How to code the transformer in PyTorch. By using the code provided in the blog, implement the encoder of the transformer network to classify each claim. You do not need the decoder since the task is to classify a claim (not to translate a claim). Similar to 2a, implement a transformer encoder that produces a classification based on (i) the claim only, (ii) the claim and the claimant only (concatenated in one sequence) and (iii) the claim, the claimant and the 5 sentences from the related articles that are most similar to the claim (concatenated in one sequence). There are many design questions that are left open. Feel free to try different pre-trained embedding techniques to tokenize the sequence and embed the tokens. Feel free to experiment with different number of layers in your architecture and to train for as many iterations as you think is appropriate. You will encounter additional design choices. There is no perfect design and grades will not be given based on the final accuracy, but based on your understanding of the design choices that you make. Simply describe and justify those design choices in your assignment submission. Hand in
  - Your code with an explanation of each design choice
  - Two graphs where the y axis is the classification accuracy and the x axis is the number of iterations. The first graph will have three curves for the training accuracy of (i), (ii) and (iii). The second graph will have three curves for the testing accuracy of (i), (ii) and (iii).
  - Discuss the results, i.e., why does the inclusion of some information improve or reduce the classification accuracy? Why does the transformer encoder perform better or worse than the RNN encoder?
Chengyao Fu (c36fu [at] uwaterloo [dot] ca) and Sagar Kulkarni (s22kulka [at] uwaterloo [dot] ca) are the TAs responsible for A5. They will hold special office hours to answer questions about A5 on July 24, 2-4 pm in the AI lab (DC2306C).