Home Goals Textbook Schedule Assignments Critiques Presentation Project Marks Policies Pascal's Homepage

CS885 Fall 2021 - Reinforcement Learning

There will be three assignments, each worth 10% of the final grade. Assignments are done individually (i.e., no team). Each assignment will have a programming part to be done in Python. Some assignments will make use of PyTorch, TensorFlow and OpenAI Gym. For GPU or TPU acceleration, feel free to use Google's Colaboratory environment. This is a free cloud service where you can run Python code (including TensorFlow and PyTorch, which are pre-installed) with GPU or TPU acceleration. A virtual machine with two CPUs and one GPU or TPU will run up to 12 hours after which it must be restarted. The following steps are recommended:

The approximate out and due dates are:

Your assignment should be submitted electronically via LEARN. You can make as many submissions as you wish. Your last submission will be marked. A late submission will incur a 2% penalty for every rounded up hour past the deadline. For example, an assignment submitted 5 hours and 15 minutes late will receive a penalty of ceiling(5.25) * 2% = 12%. Assignments submitted more than 50 hours late will not be marked.

Assignment 1: out Sept 13, due Sept 24 (11:59 pm)

This assignment has three parts. [10 points total]

Part I [4 points]

In the first part, you will program value iteration, policy iteration and modified policy iteration for Markov decision processes in Python. More specifically, fill in the functions in the skeleton code of the file MDP.py. The file TestMDP.py contains the simple MDP example from Lecture 2a Slides 13-14. You can verify that your code compiles properly with TestMDP.py by running "python TestMDP.py". Add print statements to this file to verify that the output of each function makes sense.

Submit the following material via LEARN:

Part II [3 points]

In the second part, you will program the Q-learning algorithm in Python. More specifically, fill in the functions in the skeleton code of the file RL.py. This file requires the file MDP.py that you programmed for part I so make sure to include it in the same directory. The file TestRL.py contains a simple RL problem to test your functions (i.e. the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestRL.py".

Submit the following material via LEARN:

Part III [3 points]

In the third part, you will train a deep Q-network to solve the CartPole problem from Open AI Gym. This problem has continuous states that prevent the use of a tabular representation. Instead, you will use a neural network to represent the Q-function. Follow these steps to get started:

Submit the following material via LEARN:

Assignment 2: out September 27, due October 8 (11:59 pm)

This assignment has two parts. [10 points total + 1 bonus point]

Part I [6 points]

In the first part, you will program three bandit algorithms (epsilon-greedy, Thompson sampling and UCB) and two RL algorithms (REINFORCE and model-based RL) in Python. More specifically, fill in the functions in the skeleton code of the file RL2.py. This file requires the file MDP.py that you programmed in Assignment 1 so make sure to include it in the same directory. The file TestRL2.py contains a simple bandit and a simple RL problem to test your functions (i.e., the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestRL2.py".

Submit the following material via LEARN:

Part II [4 points + 1 bonus point]

In the second part of the assignment, you will program the Soft Q-Learning and Soft Actor Critic algorithms (see the module on Maximum Entropy Reinforcement Learning). You will test your implementation on the cartpole problem, and the pendulum problem (bonus). Start by downloading and running the following implementation of DQN in Pytorch. Then modify this DQN implementation to obtain Soft Q-Learning and Soft Actor Critic. Finally, for bonus marks, you may extend your Soft Actor Critic implementation to work with continuous actions and test it on the Pendulum environment.

Submit the following material via LEARN:

Assignment 3: out October 12, due October 27 (11:59 pm)

This assignment has three parts. [10 points total]

Part I - Partially Observable RL [3 points]

Starter code: cs885_fall21_a3_part1.zip

In this part, you will implement the Deep Recurrent Q learning (DRQN) algorithm. You will test your implementation on a modified version of the Cartpole domain. The Cartpole environment will be a POMDP where the agent will receive only a partial state (observation) of the environment instead of the full state. Start by downloading and running the implementation of DQN in Pytorch in the Cartpole domain (make sure that all the packages in the given requirements.txt file are installed). You should run the train.py file in the DQN folder. This DQN uses a stack of the last 4 observations as an approximation to its state, since this is a partially observable domain. The stack of 4 observations helps the agent learn the temporal dependencies between the states. Now implement the DRQN algorithm by filling the functions in the template code of the DRQN folder. You need to modify the model.py file in the DRQN folder. Note that for this assignment, you are supposed to use the partially observable environment API provided, which returns the partial state (observation) at each time step. You should not change this API to obtain the full state.

Submit the following material via LEARN:

Part II - Imitation Learning [3 points]

Starter code: cs885_fall21_a3_part2.zip

Task Description. In this part of the assignment, you will program a Generative Adversarial Imitation Learning (GAIL) algorithm. Inside GAIL, update the policy parameters with a deterministic policy gradient update technique instead of Trust Region Policy Optimization (TRPO), since we are using a deterministic policy for this part. The value of lambda in the pseudocode of GAIL should be set to 0 for this assignment. To be more specific, fill in the function train() in the class GAIL(). You will find an example implementation of Behavior Cloning (BC) as discussed in the lectures. It is suggested to start from this example as you work on your implementation.

Running Requirement. This code uses the BipedalWalker-v2 environment from Open AI gym. The expert trajectories can be found in expert_traj/BipedalWalker-v2, and the required python packages can be found in the file requirement.txt. Make sure to install these packages before running the code. Note that some versions of packages needed for this part are different from the previous parts. All the configurations are defined in the function init_config(). Please use the default model parameters during training, but if you are not using a GPU, set --device to cpu (for example: python irl_BipedalWalker.py --device cpu).

Submit the following material via LEARN:

Part III - Distributional RL [4 points]

Starter code: cs885_fall21_a3_part3.zip

In this part, you will program the categorical (C51) distributional RL algorithm. The environment will be a modified version of the Cartpole domain. In this domain a noise sampled uniformly at random in the range (-0.05, 0.05) is included with the magnitude of the force applied to the pole every time an action is taken. In addition, the cart and the surface have a constant friction of 5 x 10-4 units and the pole has an air drag (friction) of 2 x 10-6 units. These changes turn the deterministic Cartpole environment into a stochastic environment. Start by downloading and running the provided implementation of DQN in Cartpole (run the dqn.py file in the DQN folder). Now implement the C51 algorithm by modifying the functions in the c51.py file in the C51 folder. Specifically you will need to implement the functions train_networks and compute_targets

Submit the following material via LEARN: