Home Goals Textbook Schedule Assignments Critiques Presentation Project Marks Policies Pascal's Homepage

CS885 Fall 2022 - Reinforcement Learning

There will be three assignments, each worth 10% of the final grade. Assignments are done individually (i.e., no team). Each assignment will have a programming part to be done in Python. Some assignments will make use of PyTorch, TensorFlow and OpenAI Gym. For GPU or TPU acceleration, feel free to use Google's Colaboratory environment. This is a free cloud service where you can run Python code (including TensorFlow and PyTorch, which are pre-installed) with GPU or TPU acceleration. A virtual machine with two CPUs and one GPU or TPU will run up to 12 hours after which it must be restarted. The following steps are recommended:

The approximate out and due dates are:

Your assignment should be submitted electronically via LEARN. You can make as many submissions as you wish. Your last submission will be marked. A late submission will incur a 2% penalty for every rounded up hour past the deadline. For example, an assignment submitted 5 hours and 15 minutes late will receive a penalty of ceiling(5.25) * 2% = 12%. Assignments submitted more than 50 hours late will not be marked.

Assignment 1: out Sept 14, due Sept 28 (11:59 pm)

This assignment has three parts. [10 points total]

Part I [4 points]

In the first part, you will program value iteration, policy iteration and modified policy iteration for Markov decision processes in Python. More specifically, fill in the functions in the skeleton code of the file MDP.py. The file TestMDP.py contains the simple MDP example from Lecture 1b Slides 17-18. You can verify that your code compiles properly with TestMDP.py by running "python TestMDP.py". Add print statements to this file to verify that the output of each function makes sense.

Submit the following material via LEARN:

Part II [3 points]

In the second part, you will program the Q-learning algorithm in Python. More specifically, fill in the functions in the skeleton code of the file RL.py. This file requires the file MDP.py that you programmed for part I so make sure to include it in the same directory. The file TestRL.py contains a simple RL problem to test your functions (i.e. the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestRL.py".

Submit the following material via LEARN:

Part III [3 points]

In the third part, you will train a deep Q-network to solve the CartPole problem from Open AI Gym. This problem has continuous states that prevent the use of a tabular representation. Instead, you will use a neural network to represent the Q-function. Follow these steps to get started:

Submit the following material via LEARN:

Assignment 2: out September 28, due October 19 (11:59 pm)

This assignment has three parts. [10 points]

Part I [4 points]

In the first part, you will program three bandit algorithms (epsilon-greedy, Thompson sampling and UCB) in Python. More specifically, fill in the functions in the skeleton code of the file RL2.py. This file requires the file MDP.py that you programmed in Assignment 1 so make sure to include it in the same directory. The file TestBandit.py contains a simple bandit problem to test your functions (i.e., the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestBandit.py".

Submit the following material via LEARN:

Part II [3 points]

In the second part of the assignment, you will program the REINFORCE algorithm with a baseline and the Proximal Policy Optimization (PPO) algorithm. Use the following starter code. The complete REINFORCE algorithm is provided. Fill-in the train function for REINFORCE with baseline and PPO. You will test your code on the cartpole problem (same as assignment 1).

Submit the following material via LEARN:

Part III [3 points]

In the third part of the assignment, you will program the conservative Q-learning (CQL) algorithm for offline RL. Download the starter code below. Follow the instructions in readme.md to create a conda environment with the specified packages. Download the data that will be used for offline RL. Then run the skeleton code with the offline data for cartpole. The skeleton code includes a fully specified version of offline DQN and a partially specified version of CQL. Fill in the learn function of CQL in cql.py.

Submit the following material via LEARN:

Assignment 3: out October 19, due November 2 (11:59 pm)

This assignment has two parts. [10 points total]

Part I - Constrained RL [5 points]

Starter code: a3_part1_starter_code.zip

In this part, you will program the PPO Penalty algorithm for constrained RL. The environment will be a modification of the Cartpole Gym environment, where a cost (or constraint value) of 1 is incurred for every timestep that the agent spends in the region x >= 1. For reference, the agent is normally free to move in the interval [-2.4, 2.4] in the Cartpole environment. Start by downloading and running the provided implementation of PPO (ppo.py) with the given Cartpole environment (the new environment is called ConstrainedCartPole). Then program the PPO penalty algorithm in a new file called ppopenalty.py by modifying the PPO algorithm in the file ppo.py. Run PPO and PPO Penalty for 300 epochs and average the curves based on 5 seeds, possibly showing +/- one standard deviation in the graph. Additionally, you may average the results across the past few timesteps to obtain smooth curves.

Submit the following material via LEARN:

Part II - Distributional RL [5 points]

Starter code: a3_part2_starter_code.zip

In this part, you will program the categorical (C51) distributional RL algorithm. The environment will be a modified version of the Cartpole domain. In this domain a noise sampled uniformly at random in the range (-0.05, 0.05) is included with the magnitude of the force applied to the pole every time an action is taken. In addition, the cart and the surface have a constant friction of 5 x 10-4 units and the pole has an air drag (friction) of 2 x 10-6 units. These changes turn the deterministic Cartpole environment into a stochastic environment. Start by downloading and running the provided implementation of DQN in Cartpole (run the DQN.py file). Now implement the C51 algorithm by modifying the functions in the C51.py file in the C51 folder.

Submit the following material via LEARN: