CS885 Fall 2022 - Reinforcement Learning

There will be three assignments, each worth 10% of the final grade. Assignments are done individually (i.e., no team). Each assignment will have a programming part to be done in Python. Some assignments will make use of PyTorch, TensorFlow and OpenAI Gym. For GPU or TPU acceleration, feel free to use Google's Colaboratory environment. This is a free cloud service where you can run Python code (including TensorFlow and PyTorch, which are pre-installed) with GPU or TPU acceleration. A virtual machine with two CPUs and one GPU or TPU will run up to 12 hours after which it must be restarted. The following steps are recommended:

Create a Python notebook in Google Colab
Click on "edit", then "notebook settings" and select "None" (CPU), "GPU" or "TPU" for hardware acceleration.

The approximate out and due dates are:

A1: out Sept 14, due Sept 28 (11:59 pm)
A2: out Sept 28, due Oct 19 (11:59 pm)
A3: out Oct 19, due Nov 2 (11:59 pm)

Your assignment should be submitted electronically via LEARN. You can make as many submissions as you wish. Your last submission will be marked. A late submission will incur a 2% penalty for every rounded up hour past the deadline. For example, an assignment submitted 5 hours and 15 minutes late will receive a penalty of ceiling(5.25) * 2% = 12%. Assignments submitted more than 50 hours late will not be marked.

Assignment 1: out Sept 14, due Sept 28 (11:59 pm)

This assignment has three parts. [10 points total]

Part I [4 points]

In the first part, you will program value iteration, policy iteration and modified policy iteration for Markov decision processes in Python. More specifically, fill in the functions in the skeleton code of the file MDP.py. The file TestMDP.py contains the simple MDP example from Lecture 1b Slides 17-18. You can verify that your code compiles properly with TestMDP.py by running "python TestMDP.py". Add print statements to this file to verify that the output of each function makes sense.

Skeleton code: MDP.py
Simple MDP from Lecture 1b Slides 17-18 to test your code: TestMDP.py

Submit the following material via LEARN:

Your Python code.
Test your code with the maze problem described in TestMDPmaze.py (typo corrected in transition probablities on Sept 19).

Report the policy, value function and number of iterations needed by value iteration when using a tolerance of 0.01 and starting from a value function set to 0 for all states. [1 point]
Report the policy, value function and number of iterations needed by policy iteration to find an optimal policy when starting from the policy that chooses action 0 in all states. [1 point]
Report the number of iterations needed by modified policy iteration to converge when varying the number of iterations in partial policy evaluation from 1 to 10. Use a tolerance of 0.01, start with the policy that chooses action 0 in all states and start with the value function that assigns 0 to all states. [1 point] Discuss the impact of the number of iterations in partial policy evaluation on the results and relate the results to value iteration and policy iteration. [1 point]

Part II [3 points]

In the second part, you will program the Q-learning algorithm in Python. More specifically, fill in the functions in the skeleton code of the file RL.py. This file requires the file MDP.py that you programmed for part I so make sure to include it in the same directory. The file TestRL.py contains a simple RL problem to test your functions (i.e. the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestRL.py".

Skeleton code: RL.py (requires MDP.py from part I)
Simple RL problem to test your code: TestRL.py

Submit the following material via LEARN:

Your Python code. Test your code with the maze problem described in TestRLmaze.py (same maze problem as in Part I, typo corrected in transition probablities on Sept 19)
Graph 1 [1 point]: Produce a first graph where the x-axis indicates the episode # (from 0 to 200) and the y-axis indicates the average (based on 100 trials) of the cumulative discounted rewards per episode (100 steps). The graph should contain 3 curves corresponding to the exploration probability epsilon=0.1, 0.3 and 0.5 (set temperature=0). The initial state is 0 and the initial Q-function is 0 for all state-action pairs.
Graph 2 [1 point]: Produce a second graph where the x-axis indicates the episode # (from 0 to 200) and the y-axis indicates the average (based on 100 trials) of the cumulative discounted rewards per episode (100 steps). The graph should contain 3 curves corresponding to the Boltzmann exploration temperature=0, 10 and 20 (set epsilon=0). The initial state is 0 and the initial Q-function is 0 for all state-action pairs.
Discussion [1 point]: Explain the impact of the exploration probability epsilon and the Boltzmann temperature on the cumulative discounted rewards per episode earned during training as well as the resulting Q-values and policy.

Part III [3 points]

In the third part, you will train a deep Q-network to solve the CartPole problem from Open AI Gym. This problem has continuous states that prevent the use of a tabular representation. Instead, you will use a neural network to represent the Q-function. Follow these steps to get started:

Get familiar with the CartPole problem. Read a brief description of the CartPole problem from Open AI Gym.
For this part of the assignment, you will use the PyTorch library. To get started, follow the instructions to install PyTorch and then. Go through a tutorial about the basics of PyTorch.
Run the following code to solve the CartPole problem with a Deep Q-Network. Feel free to use the code for Gym environment 0.25 or 0.26. They are both equivalent.
- Gym env 0.25:cs885_a1_part3_code_gym_env_0.25.zip
- Gym env 0.26:cs885_a1_part3_code_gym_env_0.26.zip

Submit the following material via LEARN:

Target network update frequency [1 point]: Modify the DQN code provided to produce a graph where the y-axis is the average cumulative reward of the last 25 episodes and the x-axis is the # of episodes up to 300 episodes. The graph should contain 4 curves corresponding to updating the target network every 1, 10 (default), 50, 100 episode(s). To reduce stochasticity in the results, report curves that are the average of 5 runs corresponding to 5 random seeds. Based on the results, explain the impact of the target network and relate the target network to value iteration.
Mini-batch size [1 point]: Modify the DQN code provided to produce a graph where the y-axis is the average cumulative reward of the last 25 episodes and the x-axis is the # of episodes up to 300 episodes. The graph should contain 4 curves corresponding to sampling mini-batches of 1, 10 (default), 50 and 100 experience(s) from the replay buffer. To reduce stochasticity in the results, report curves that are the average of 5 runs corresponding to 5 random seeds. Based on the results, explain the impact of the mini-batch size and relate the replay buffer mini-batch size to exact gradient descent.
Replay buffer training epochs [1 point]: Modify the DQN code provided to produce a graph where the y-axis is the average cumulative reward of the last 25 episodes and the x-axis is the # of episodes up to 300 episodes. The graph should contain 4 curves corresponding to training for 1, 5, 10 and 50 epochs with the replay buffer between each episode. To reduce stochasticity in the results, report curves that are the average of 5 runs corresponding to 5 random seeds. Based on the results, explain the impact of the number of training epochs per episode and explain similarities/differences between the number of training epochs and the mini-batch size.

Assignment 2: out September 28, due October 19 (11:59 pm)

This assignment has three parts. [10 points]

Part I [4 points]

In the first part, you will program three bandit algorithms (epsilon-greedy, Thompson sampling and UCB) in Python. More specifically, fill in the functions in the skeleton code of the file RL2.py. This file requires the file MDP.py that you programmed in Assignment 1 so make sure to include it in the same directory. The file TestBandit.py contains a simple bandit problem to test your functions (i.e., the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestBandit.py".

Skeleton code: RL2.py (requires MDP.py from assignment 1)
Simple Bandit problem to test your code: TestBandit.py

Submit the following material via LEARN:

Your Python code.

Graph [3 points]: Test your code for UCB, epsilon-greedy bandit and Thompson sampling with the simple multi-armed bandit problem described in TestBandit.py. Produce a graph where the x-axis indicates the iteration # (from 0 to 200) and the y-axis indicates the average (based on 1000 trials) of the rewards earned at each iteration. The graph should contain 3 curves corresponding to
- UCB
- Epsilon-greedy bandit (epsilon = 1 / # iterations)
- Thompson sampling (k=1 and prior consists of Beta distributions with all hyper parameters set to 1)
Explain the results [1 point]: Are the results surprising? Do they match what is expected based on the theory?

Part II [3 points]

In the second part of the assignment, you will program the REINFORCE algorithm with a baseline and the Proximal Policy Optimization (PPO) algorithm. Use the following starter code. The complete REINFORCE algorithm is provided. Fill-in the train function for REINFORCE with baseline and PPO. You will test your code on the cartpole problem (same as assignment 1).

cs885a2p2_starter_code.zip (works with gym 0.26.1)

Submit the following material via LEARN:

Your Python code.
Graphs [2 points]: Produce 3 graphs that show the performance of REINFORCE, REINFORCE with baseline and PPO on the cartpole problem. Running "python REINFORCE.py --mode=cartpole" produces the graph for REINFORCE which is saved in the "images" directory. For each graph, the y axis corresponds to the cumulative rewards (averaged over the last 25 episodes) and the x axis is the number of episodes (up to 800 for REINFORCE and REINFORCE with baseline, and up to 150 for PPO).
Explanation [1 point]: Explain the results based on the properties of each algorithm.

Part III [3 points]

In the third part of the assignment, you will program the conservative Q-learning (CQL) algorithm for offline RL. Download the starter code below. Follow the instructions in readme.md to create a conda environment with the specified packages. Download the data that will be used for offline RL. Then run the skeleton code with the offline data for cartpole. The skeleton code includes a fully specified version of offline DQN and a partially specified version of CQL. Fill in the learn function of CQL in cql.py.

cs885a2p3_starter_code.zip (works with conda environment specified in readme.md)

Submit the following material via LEARN:

Your Python code.
Graph [2 points]: Produce a graph that shows the performance of offline DQN and CQL with the offline data for cartpole. The y-axis corresponds to the cumulative rewards (averaged over the last 10 episodes) and the x axis is the number of episodes (up to 200). Feel free to take the average of more than one seed to reduce the noise in your graph.
Explanation [1 point]: Explain the results based on the properties of each algorithm.

Assignment 3: out October 19, due November 2 (11:59 pm)

This assignment has two parts. [10 points total]

Part I - Constrained RL [5 points]

Starter code: a3_part1_starter_code.zip

In this part, you will program the PPO Penalty algorithm for constrained RL. The environment will be a modification of the Cartpole Gym environment, where a cost (or constraint value) of 1 is incurred for every timestep that the agent spends in the region x >= 1. For reference, the agent is normally free to move in the interval [-2.4, 2.4] in the Cartpole environment. Start by downloading and running the provided implementation of PPO (ppo.py) with the given Cartpole environment (the new environment is called ConstrainedCartPole). Then program the PPO penalty algorithm in a new file called ppopenalty.py by modifying the PPO algorithm in the file ppo.py. Run PPO and PPO Penalty for 300 epochs and average the curves based on 5 seeds, possibly showing +/- one standard deviation in the graph. Additionally, you may average the results across the past few timesteps to obtain smooth curves.

[2 points] Implement the PPO Penalty algorithm by copying and modifying ppo.py to create ppopenalty.py. Specifically, incorporate the actor update for constrained RL and other necessary changes from the pseudocode.
[2 points] Produce an episodic reward graph and an episodic constraint value graph to compare PPO against PPO Penalty. The y-axis of the first graph is the average cumulative reward per episode and the y-axis of the second graph is the average cumulative constraint value. The x-axis will be the number of episodes up to 300 episodes. Each graph will contain 4 curves corresponding to PPO and PPO penalty with beta=1, 5, 10.
[1 point] Explain the results. Discuss the differences between PPO and PPO Penalty and discuss the effect of beta in PPO Penalty on the cumulative rewards and the cumulative constraint value.

Submit the following material via LEARN:

Your implementation of PPO Penalty (ppopenalty.py, which will be a modified version of the ppo.py file.
A report containing the plots pertaining to the PPO and PPO Penalty algorithms with your explanation of the results.

Part II - Distributional RL [5 points]

Starter code: a3_part2_starter_code.zip

In this part, you will program the categorical (C51) distributional RL algorithm. The environment will be a modified version of the Cartpole domain. In this domain a noise sampled uniformly at random in the range (-0.05, 0.05) is included with the magnitude of the force applied to the pole every time an action is taken. In addition, the cart and the surface have a constant friction of 5 x 10-4 units and the pole has an air drag (friction) of 2 x 10-6 units. These changes turn the deterministic Cartpole environment into a stochastic environment. Start by downloading and running the provided implementation of DQN in Cartpole (run the DQN.py file). Now implement the C51 algorithm by modifying the functions in the C51.py file in the C51 folder.

Produce a graph that shows the performance of the given implementation of DQN on the Cartpole domain with epsilon greedy exploration (annealed). This will serve as a baseline to compare performance of the C51 algorithm. Run the given code without any modification and plot a graph where the x-axis is the number of episodes (at least 200 episodes) and the y-axis is the reward obtained per episode. You can use a running average to smooth the plot if you so desire. Report curves that are the average of multiple runs of the DQN (with different random seeds).
[2 points] Implement the C51 algorithm by modifying the C51.py file. You will need to implement update_networks.
[1 point] Adjust the policy function to use the Z network.
[1 point] Produce a graph that shows the performance of the C51 algorithm on the Cartpole domain. Similar to the DQN case, the x-axis is the number of episodes (500 episodes) and the y-axis is the reward obtained per episode. You can use a running average to smooth the plot if you so desire. Report curves that are the average of multiple runs of C51 (with different random seeds).
[1 point] Based on the results, comment on the differences in the performance of C51 and DQN. What do the results suggest about the respective algorithms?

Submit the following material via LEARN:

Your implementation of C51, which will be a modified version of the C51.py.
A report containing the plots pertaining to C51 and DQN with your explanation of the results.