Home Goals Textbook Schedule Assignments Critiques Presentation Project Marks Policies Pascal's Homepage

CS885 Winter 2022 - Reinforcement Learning

There will be three assignments, each worth 10% of the final grade. Assignments are done individually (i.e., no team). Each assignment will have a programming part to be done in Python. Some assignments will make use of PyTorch, TensorFlow and OpenAI Gym. For GPU or TPU acceleration, feel free to use Google's Colaboratory environment. This is a free cloud service where you can run Python code (including TensorFlow and PyTorch, which are pre-installed) with GPU or TPU acceleration. A virtual machine with two CPUs and one GPU or TPU will run up to 12 hours after which it must be restarted. The following steps are recommended:

The approximate out and due dates are:

Your assignment should be submitted electronically via LEARN. You can make as many submissions as you wish. Your last submission will be marked. A late submission will incur a 2% penalty for every rounded up hour past the deadline. For example, an assignment submitted 5 hours and 15 minutes late will receive a penalty of ceiling(5.25) * 2% = 12%. Assignments submitted more than 50 hours late will not be marked.

Assignment 1: out Jan 10, due Jan 21 (11:59 pm)

This assignment has three parts. [10 points total]

Part I [4 points]

In the first part, you will program value iteration, policy iteration and modified policy iteration for Markov decision processes in Python. More specifically, fill in the functions in the skeleton code of the file MDP.py. The file TestMDP.py contains the simple MDP example from Lecture 2a Slides 13-14. You can verify that your code compiles properly with TestMDP.py by running "python TestMDP.py". Add print statements to this file to verify that the output of each function makes sense.

Submit the following material via LEARN:

Part II [3 points]

In the second part, you will program the Q-learning algorithm in Python. More specifically, fill in the functions in the skeleton code of the file RL.py. This file requires the file MDP.py that you programmed for part I so make sure to include it in the same directory. The file TestRL.py contains a simple RL problem to test your functions (i.e. the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestRL.py".

Submit the following material via LEARN:

Part III [3 points]

In the third part, you will train a deep Q-network to solve the CartPole problem from Open AI Gym. This problem has continuous states that prevent the use of a tabular representation. Instead, you will use a neural network to represent the Q-function. Follow these steps to get started:

Submit the following material via LEARN:

Assignment 2: out January 24, due February 4 (11:59 pm)

This assignment has two parts. [10 points total + 1 bonus point]

Part I [5 points]

In the first part, you will program three bandit algorithms (epsilon-greedy, Thompson sampling and UCB) and one RL algorithms (model-based RL) in Python. More specifically, fill in the functions in the skeleton code of the file RL2.py. This file requires the file MDP.py that you programmed in Assignment 1 so make sure to include it in the same directory. The file TestRL2.py contains a simple bandit and a simple RL problem to test your functions (i.e., the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestRL2.py".

Submit the following material via LEARN:

Part II [5 points + 1 bonus point]

In the second part of the assignment, you will program the REINFORCE algorithm with a baseline and the Proximal Policy Optimization (PPO) algorithm. Use the following starter code. The complete REINFORCE algorithm is provided. Fill-in the train function for REINFORCE with baseline and PPO. You will test your code on the cartpole problem (same as assignment 1), mountain-car problem and a modified version of the mountain-car problem.

Submit the following material via LEARN:

Assignment 3: out February 7, due February 18 (11:59 pm)

This assignment has two parts. [10 points total]

Part I - Partially Observable RL [5 points]

Starter code: a3_part1_starter_code.zip

In this part, you will implement the Deep Recurrent Q learning (DRQN) algorithm. You will test your implementation on a modified version of the Cartpole domain. The Cartpole environment will be a POMDP where the agent will receive only a partial state (observation) of the environment instead of the full state. Start by downloading and running the implementation of DQN in Pytorch in the Cartpole domain. You should run the DQN.py file in the starter code. Implement the DRQN algorithm by filling the functions in the template code provided in the file DRQN.py. You will need to modify the model specification in DRQN.py. Note that for this assignment, you are supposed to use the partially observable environment API provided, which returns the partial state (observation) at each time step. You should not change this API to obtain the full state.

  • (1 point) Based on the results explain the impact of the LSTM layer. Compare the performance of using DRQN and DQN. Elucidate the reasons for the observed difference in performance.
  • Submit the following material via LEARN:

    Part II - Distributional RL [5 points]

    Starter code: a3_part2_starter_code.zip

    In this part, you will program the categorical (C51) distributional RL algorithm. The environment will be a modified version of the Cartpole domain. In this domain a noise sampled uniformly at random in the range (-0.05, 0.05) is included with the magnitude of the force applied to the pole every time an action is taken. In addition, the cart and the surface have a constant friction of 5 x 10-4 units and the pole has an air drag (friction) of 2 x 10-6 units. These changes turn the deterministic Cartpole environment into a stochastic environment. Start by downloading and running the provided implementation of DQN in Cartpole (run the DQN.py file). Now implement the C51 algorithm by modifying the functions in the C51.py file in the C51 folder.

    Submit the following material via LEARN: