CS480/680 Winter 2023 - Introduction to Machine Learning

There will be five assignments, each worth 15% of the final mark. Assignments are done individually (i.e., no team). The assignments will consist of a mixture of theoretical questions and programming questions. Some assignments may make use of TensorFlow or PyTorch. For GPU and TPU acceleration, feel free to use Google's Colaboratory environment. This is a free cloud service where you can run Python code (including TensorFlow and PyTorch, which are pre-installed) with GPU or TPU acceleration. A virtual machine with two CPUs and one GPU or TPU will run up to 12 hours after which it must be restarted. The following steps are recommended:

Create a Python notebook in Google Colab
Click on "edit", then "notebook settings" and select "None" (CPU), "GPU" or "TPU" for hardware acceleration.

The approximate out and due dates are:

A1: out Jan 16, due Jan 27 (11:59 pm)
A2: out Jan 30, due Feb 10 (11:59 pm)
A3: out Feb 13, due Mar 3 (11:59 pm)
A4: out Mar 6, due Mar 17 (11:59 pm)
A5: out Mar 20, due Mar 31 (11:59 pm)

On the due date of an assignment, the work done to date should be submitted electronically on the LEARN website; further material may be submitted with a 2% penalty for every rounded up hour past the deadline. For example, an assignment submitted 5 hours and 15 min late will receive a penalty of ceiling(5.25) * 2% = 12%. Assignments submitted more than 50 hours late will not be marked.

Assignment 1: due Jan 27 (11:59 pm)

Part 1 (5 points): K-nearest neighbours

Download the dataset for K-nearest neighbours: knn-dataset.zip

Origin: this data is a modified version of the Optical Recognition of Handwritten Digits Dataset from the UCI repository. It contains pre-processed black and white images of the digits 5 and 6. Each feature indicates how many pixels are black in a patch of 4 x 4 pixels.
Format: there is one row per image and one column per feature. The class labels are 5 and 6. The label on line n in train_labels.csv is the label for the data point on line n in train_inputs.csv.

Implement k-nearest neighbours by filling in the functions in the skeleton code: cs480_winter23_asst1_knn_skeleton.ipynb
Do not import any additional library. Feel free to run the Jupyter notebook on any machine or Google Colab. Google Colab is a free cloud environment provided by Google that allows you to run Jupyter notebooks very easily. Python and all necessary libraries are already installed.
Once you are done filling in all the functions, run the Jupyter notebook entirely and save the following results:

A graph that shows the average accuracy based on 10-fold cross validation when varying the number of neighbours from 1 to 30.
The best number of neighbours found by 10-fold cross validation and its cross-validation accuracy.
The test accuracy based on the best number of neighbours

Upload to LEARN your Jupyter notebook with the results saved. Do not submit a zip file or pdf file. The TAs will run some of the Jupyter notebooks to verify the results.

Part 2 (5 points): Linear regression

Download the dataset for linear regression: regression-dataset.zip

Origin: this data consists of samples from a 2D surface that you can plot to visualize how linear regression is working.
Format: there is one row per data point and one column per feature. The targets are real values. The target on line n in train_targets.csv is the target for the data point on line n in train_inputs.csv.

Implement linear regression by filling in the functions in the skeleton code: cs480_winter23_asst1_linear_regression_skeleton.ipynb
Do not import any additional library. Feel free to run the Jupyter notebook on any machine or Google Colab. Google Colab is a free cloud environment provided by Google that allows you to run Jupyter notebooks very easily. Python and all necessary libraries are already installed.
Once you are done filling in all the functions, run the Jupyter notebook entirely and save the following results:

A graph that shows the average mean squared error based on 10-fold cross validation when varying the lambda hyperparameter from 0 to 3 in increments of 0.1.
The best lambda found by 10-fold cross validation and its cross validation mean squared error.
The test mean squared error based on the best lambda.

Upload to LEARN your Jupyter notebook with the results saved. Do not submit a zip file or pdf file. The TAs will run some of the Jupyter notebooks to verify the results.

Part 3 (5 points): Theory. In class, we discussed several loss functions for linear regression. However all the loss functions that we discussed assume that the error contributed by each data point have the same importance. Consider a scenario where we would like to give more weight to some data points. Our goal is to fit the data points (x_n, y_n) in proportion to their weights r_n by minimizing the following objective: L(w) = Σ_n r_n(y_n − w^Tx̄_n)², where w is the vector of model parameters and (x_n, y_n) is a training data pair. To simplify things, feel free to consider 1D data (i.e., x_n is a scalar).
- Derive a closed-form expression for the estimate of w that minimizes the objective. Show the steps along the way, not just the final estimates.
- Show that this objective is equivalent to the negative log-likelihood for linear regression where each data point may have a different Gaussian measurement noise. What is the variance of each measurement noise in this model?
- Upload to LEARN a pdf file with your answers for the two previous questions.

Assignment 2: due Feb 10 (11:59 pm)

In this assignment, you will implement logistic regression. Then you will test your implementations on a small dataset.

Dataset: Use the same dataset as for the K-nearest neighbour in Assignment 1
Algorithm implementation: Implement logistic regression based on gradient descent and Newton's algorithm by filling in the functions in the skeleton code. The skeleton code consists of a Python Jupyter notebook:

Logistic regression skeleton: cs480_winter23_asst2_logistic_regression_skeleton.ipynb
Do not import any additional library. Feel free to run the Jupyter notebooks on any machine or Google Colab. Google Colab is a free cloud environment provided by Google that allows you to run Jupyter notebooks very easily. Python and all necessary libraries are already installed.

Submission via LEARN: Jupyter notebook

Results: Once you are done filling in all the functions, run the Jupyter notebook entirely and save the results. For the purpose of the assignment, use the default values included in the skeleton code for max_iters, gradient_norm_threshold and learning_rate. Make sure that the following results are saved:

Gradient descent (4 points):

Graph that shows the negative log probabilties based on 10-fold cross validation when varying the lambda hyperparameter from 0 to 25 in increments of 1.
The best lambda found by 10-fold cross validation and its cross validation negative log probability.
The test negative log probability and the test accuracy based on the best lambda
The number of iterations of gradient descent for the best lambda

Newton's algorithm (4 points):

Graph that shows the negative log probabilties based on 10-fold cross validation when varying the lambda hyperparameter from 0 to 25 in increments of 1.
The best lambda found by 10-fold cross validation and its cross validation negative log probability.
The test negative log probability and the test accuracy based on the best lambda
The number of iterations of Newton's algorithm for the best lambda

Discussion: At the end of the Jupyter notebook, add a text cell and answer the following questions:

Question 1 (2 points): Compare the results for gradient descent and Newton's algorithm. Discuss the scalability of each algorithm (i.e., time and space complexity). For what type of dataset would you use gradient descent versus Newton's algorithm?
Question 2 (2 points): Logistic regression finds a linear separator where as k-Nearest Neighbours (in Assignment 1) finds a non-linear separator. Compare the expressivity of the separators. Discuss under what circumstances each type of separator is expected to perform best. What could explain the results obtained with KNN in comparison to the results obtained with logistic regression?
Question 3 (3 points): Is the training set used in this assignment linearly separable? To answer this question, add some code to the Jupyter Notebook that uses a logistic regression classifier to determine whether the training set is linearly separable. Add some text that explains why this code can determine the linear separability of a dataset. Indicate whether the training set is linearly separable based on the results.

Assignment 3: due March 3 (11:59 pm)

In this assignment, you will experiment with fully connected neural networks and convolutional neural networks, using the PyTorch package. PyTorch facilitates the design of neural networks, automatic differentiation and accelerated computation with GPUs and multi-core CPUs. Preliminary steps:

Familiarize yourself with PyTorch by going through the tutorial Get familiar with PyTorch: a 60 minute blitz
Download and install PyTorch on a machine with a GPU or use Google's Colaboratory environment, which allows you to run PyTorch code on a GPU in the cloud. Colab already has PyTorch pre-installed. To enable GPU acceleration, click on "edit", then "notebook settings" and select "GPU" for hardware acceleration. It is also possible to select "TPU", but the PyTorch code provided with this assignment will need to be modified in a non-trivial way to take advantage of TPU acceleration. Note that you can complete this assignment without any GPU (or TPU). Any computer that has several cores will also provide some degree of acceleration.
Download the base code for this assignment: cs480_winter23_asst3_cnn_cifar10.ipynb. (bug in the test function corrected on Feb 26: the test function now returns the averag_test_loss instead of the average_train_loss)

Answer the following questions by modifying the base code in cs480_winter23_asst3_cnn_cifar10.ipynb. Submit the modified Jupyter notebook via LEARN.

Part 1 (3 points) Architecture: Compare the accuracy of the convolutional neural network in the file cs480_winter23_asst3_cnn_cifar10.ipynb on the cifar10 dataset to the accuracy of simple dense neural networks with 0, 1, 2 and 3 hidden layers of 512 rectified linear units each. Run the code in the file cs480_winter23_asst3_cnn_cifar10.ipynb without changing the parameters to train a convolutional neural network. Then, modify the code in cs480_winter23_asst3_cnn_cifar10.ipynb to obtain simple dense neural networks with 0, 1, 2 and 3 hidden layers of 512 rectified linear units. Produce two graphs that contain 5 curves (one for the convolutional neural net and one for each dense neural net of 0-3 hidden layers). The y-axis is the accuracy and the x-axis is the number of epochs (# of passes through the training set). Produce one graph where all the curves correspond to the training accuracy and a second graph where all the curves correspond to the test accuracy. Train the neural networks for 10 epochs. Save the following results in your Jupyter notebook:

The two graphs for training and testing accuracy.
Add some text to the Jupyter notebook to explain the results (i.e., why some models perform better or worse than other models).

Part 2 (2 point) Activation functions: Compare the accuracy achieved by rectified linear units and sigmoid units in the convolutional neural network in cs480_winter23_asst3_cnn_cifar10.ipynb. Modify the code in cs480_winter23_asst3_cnn_cifar10.ipynb to use sigmoid units. Produce two graphs (one for training accuracy and one for test accuracy) that each contain 2 curves (one for rectified linear units and another one for sigmoid units). The y-axis is the accuracy and the x-axis is the number of epochs. Train the neural networks for 10 epochs. Save the following results in your Jupyter notebook:

The two graphs for training and test accuracy.
Add some text to the Jupyter notebook to explain the results (i.e., why one model performs better or worse than the other model).

Part 3 (2 points) Dropout: Compare the accuracy achieved with and without dropout in the convolutional neural network in cs480_winter23_asst3_cnn_cifar10.ipynb. Modify the code in cs480_winter23_asst3_cnn_cifar10.ipynb by inserting a dropout probability of 0.25 after each max pooling layer and a dropout probability of 0.5 after the hidden fully connected layer. Produce two graphs (one for training accuracy and the other one for testing accuracy) that each contain 2 curves (with and without dropout). The y-axis is the accuracy and the x-axis is the number of epochs. Produce curves for 20 epochs.

The two graphs for training and testing accuracy.
Add some text to the Jupyter notebook to explain the results (i.e., why did one model perform better than the other one).

Part 4 (2 point) Optimizers: Compare the accuracy achieved when training the convolutional neural network in cs480_winter23_asst3_cnn_cifar10.ipynb with four different optimizers: SGD (learning rate = 0.001), RMSprop (learning rate = 0.0001), Adagrad (default parameters) and Adam (default parameters). Modify the code in cs480_winter23_asst3_cnn_cifar10.ipynb to use the SGD, Adagrad and RMSprop optimizers. Produce two graphs (one for training accuracy and the other one for testing accuracy) that each contain 4 curves (for SGD, RMSprop, Adagrad and Adam). The y-axis is the accuracy and the x-axis is the number of epochs. Produce curves up to 10 epochs.

The two graphs for training and testing accuracy.
Add some text to the Jupyter notebook to explain the results (i.e., why did some optimizers perform better or worse than other optimizers).

Part 5 (3 points) Filters: Compare the accuracy of the convolutional neural network in cs480_winter23_asst3_cnn_cifar10.ipynb with a modified version that replaces each stack of (Conv2d, Activation, Max pooling) layers with 5x5 filters by a deeper stack of (Conv2d, Activation, Conv2d, Activation, Max Pooling) layers with 3x3 filters. Produce two graphs (one for training accuracy and the other one for testing accuracy) that each contain 2 curves (for 3x3 filters and 5x5 filters). The y-axis is the accuracy and the x-axis is the number of epochs. Produce curves up to 10 epochs.

The two graphs for training and testing accuracy.
Add some text to the Jupyter notebook to explain the results (i.e., why did one architecture perform better or worse than the other architecture).

Part 6 (3 points) Theory: Show that a neural network that uses the tanh activation function can represent the same space of functions as another neural network that uses sigmoid activation functions instead of tanh activation functions. More preciseley, let f(x) = W⁽¹⁾ tanh(W⁽²⁾ x + b⁽²⁾) + b⁽¹⁾ be a two-layer neural network that uses the tanh activation function for the hidden layer. Design a mathematically equivalent neural network g(x) that uses the sigmoid activation function instead of tanh. Show that f(x)=g(x).

Assignment 4: due March 17 (11:59 pm)

In this assignment, you will experiment with various types of recurrent neural networks (RNNs) and transformers in PyTorch.

Download the base code for this assignment:

Part 1: cs480_winter23_asst4_char_rnn_classification.ipynb
Part 2: cs480_winter23_asst4_char_rnn_generation.ipynb
Part 3: cs480_winter23_asst4_seq2seq_translation.ipynb

Answer the following questions by modifying the base code in each notebook. Submit the modified Jupyter notebooks via LEARN.

Part 1 (5 points): Encoder implementation in cs480_winter23_asst4_char_rnn_classification.ipynb. Compare the accuracy of the encoder when varying the type of hidden units: linear units, gated recurrent units (GRUs) and long short term memory (LSTM) units. For linear hidden units, just run the script of the Jupyter notebook as it is. For GRUs and LSTMs, modify the base code. Save the following results in your Jupyter notebook:

Two graphs that each contain 3 curves (linear hidden units, GRUs and LSTM units). The first graph displays the training loss and the second graph displays the validation loss. In both graphs, the y-axis is the negative log likelihood and the x-axis is the number of thousands of iterations.
For each type of hidden unit, print the test loss and the test confusion matrix of the model that achieved the best validation loss among all iterations (i.e., one best test loss and test confusion matrix per type of hidden unit).
Explanation of the results (i.e., why some hidden units perform better or worse than other units).

Part 2 (5 points): Decoder implementation in cs480_winter23_asst4_char_rnn_generation.ipynb. Compare the accuracy of the decoder when varying the information fed as input to the hidden units at each time step: i) previous hidden unit, previous character and category; ii) previous hidden unit and previous character; iii) previous hidden unit and category; iv) previous hidden unit. For i), just run the Python notebook as it is. For ii) and iv) modify the code to feed the category as input to the hidden unit(s) of the first time steps only. For iii) and iv), modify the code to avoid feeding the previous character as input to each hidden unit. Save the following results in your Jupyter notebook:

Two graphs that each contain 4 curves (i, ii, iii, iv). The first graph displays the training loss and the second graph displays the validation loss. In both graphs, the y-axis is the negative log likelihood and the x-axis is the number of 500 iterations.
For each architecture, print the test loss of the model that achieved the best validation loss among all iterations (i.e., one best test loss per architecture).
Explanation of the results (i.e., how does the type of information fed to the hidden units affect the results).

Part 3 (5 points): Seq2seq implementation in cs480_winter23_asst4_seq2seq_translation.ipynb. Compare the accuracy of RNN seq2seq models with and without attention as well as a transformer. To keep the running time reasonable, use a hidden_size of 32 (in practice, a hidden size of at least 512 would be used, but for the purpose of this assignment, use 32). Test the translation models on sentences of MAX_LENGTH = 5 and 10. For the RNN seq2seq model with attention (hidden_size=32 and MAX_LENGTH=5), just run the base code as it is. For the RNN seq2seq model without attention, modify the base code to call the DecoderRNN class (without attention) already provided. For the transformer model, use the nn.TransformerEncoder, nn.TransformerDecoder or nn.Transformer classes in the PyTorch library and modify any part of the base code that you see fit. Set the parameters of the transfomer to match those of the RNN see2seq models whenever possible (i.e., d_model=32), otherwise feel free to choose suitable parameters. Save the following results in your Jupyter notebook:

Four graphs that each contain 3 curves (RNN seq2seq with attention and RNN seq2seq without attention and transformer). The first and second graphs display the training loss and the validation loss respectively for sentences of MAX_LENGTH=5. The third and fourth graphs display the training loss and validation loss respectively for sentences of MAX_LENGTH=10. In all graphs, the y-axis is the negative log likelihood and the x-axis is the number of thousands of iterations.
For each architecture tested with sentences of MAX_LENGTH=5, print the test loss of the model that achieved the best validation loss among all iterations (i.e., one best test loss per architecture). Similarly, for each architecture tested with sentences of MAX_LENGTH=10, print the test loss of the model that achieved the best validation loss among all iterations (i.e., one best test loss per architecture).
Explanation of the results (i.e., how does the attention and architecture of each model affects the results?).

Assignment 5: due March 31 (11:59 pm)

In this assignment, you will implement a variational auto-encoder (VAE) and a generative adversarial network (GAN) in PyTorch to generate images similar to those in the MNIST dataset. As a starting point, the code for a deterministic auto-encoder (DAE) is provided. While DAEs achieve good reconstruction of the original images, they struggle to generate new images that are similar to those in MNIST. Implement a VAE and GAN to generate better images. Download the skeleton code for this assignment:

Part 1: cs480_winter23_asst5_vae_skeleton.ipynb
Part 2: cs480_winter23_asst5_gan_skeleton.ipynb

Fill in the functions in each skeleton notebook and answer the following questions in each notebook. Submit the Jupyter notebooks via LEARN.

Part 1 (7 points): VAE implementation in cs480_winter23_asst5_vae_skeleton.ipynb. Fill in the functions and save the following results in your Jupyter notebook:

Two graphs that each contain 2 curves (DAE and VAE). The first graph displays the training reconstruction loss and the second graph displays the testing reconstruction loss. In both graphs, the y-axis is binary cross entropy and the x-axis is the number of epochs.
Print a sample of generated images after each epoch of training for both DAEs and VAEs.
Explanation of the results (i.e., compare and explain the binary cross entropy and the quality of the sampled images generated by DAEs and VAEs).

Part 2 (8 points): GAN implementation in cs480_winter23_asst5_gan_skeleton.ipynb. Checkout the following tutorial for GANs. Fill in the functions and save the following results in your Jupyter notebook:

Two graphs that each contain 2 curves (Generator and Discriminator losses). The first graph displays the training loss and the second graph displays the testing loss. In both graphs, the y-axis is binary cross entropy and the x-axis is the number of epochs.
Print a sample of generated images after each epoch of training for your GAN.
Explanation of the results (i.e., compare and explain the quality of the sampled images generated by VAEs and GANs).