CS885 Winter 2022 - Reinforcement Learning

There will be three assignments, each worth 10% of the final grade. Assignments are done individually (i.e., no team). Each assignment will have a programming part to be done in Python. Some assignments will make use of PyTorch, TensorFlow and OpenAI Gym. For GPU or TPU acceleration, feel free to use Google's Colaboratory environment. This is a free cloud service where you can run Python code (including TensorFlow and PyTorch, which are pre-installed) with GPU or TPU acceleration. A virtual machine with two CPUs and one GPU or TPU will run up to 12 hours after which it must be restarted. The following steps are recommended:

Create a Python notebook in Google Colab
Click on "edit", then "notebook settings" and select "None" (CPU), "GPU" or "TPU" for hardware acceleration.

The approximate out and due dates are:

A1: out Jan 10, due Jan 21 (11:59 pm)
A2: out Jan 24, due Feb 4 (11:59 pm)
A3: out Feb 7, due Feb 18 (11:59 pm)

Your assignment should be submitted electronically via LEARN. You can make as many submissions as you wish. Your last submission will be marked. A late submission will incur a 2% penalty for every rounded up hour past the deadline. For example, an assignment submitted 5 hours and 15 minutes late will receive a penalty of ceiling(5.25) * 2% = 12%. Assignments submitted more than 50 hours late will not be marked.

Assignment 1: out Jan 10, due Jan 21 (11:59 pm)

This assignment has three parts. [10 points total]

Part I [4 points]

In the first part, you will program value iteration, policy iteration and modified policy iteration for Markov decision processes in Python. More specifically, fill in the functions in the skeleton code of the file MDP.py. The file TestMDP.py contains the simple MDP example from Lecture 2a Slides 13-14. You can verify that your code compiles properly with TestMDP.py by running "python TestMDP.py". Add print statements to this file to verify that the output of each function makes sense.

Skeleton code: MDP.py
Simple MDP from Lecture 2a Slides 13-14 to test your code: TestMDP.py

Submit the following material via LEARN:

Your Python code.
Test your code with the maze problem described in TestMDPmaze.py.

Report the policy, value function and number of iterations needed by value iteration when using a tolerance of 0.01 and starting from a value function set to 0 for all states. [1 point]
Report the policy, value function and number of iterations needed by policy iteration to find an optimal policy when starting from the policy that chooses action 0 in all states. [1 point]
Report the number of iterations needed by modified policy iteration to converge when varying the number of iterations in partial policy evaluation from 1 to 10. Use a tolerance of 0.01, start with the policy that chooses action 0 in all states and start with the value function that assigns 0 to all states. [1 point] Discuss the impact of the number of iterations in partial policy evaluation on the results and relate the results to value iteration and policy iteration. [1 point]

Part II [3 points]

In the second part, you will program the Q-learning algorithm in Python. More specifically, fill in the functions in the skeleton code of the file RL.py. This file requires the file MDP.py that you programmed for part I so make sure to include it in the same directory. The file TestRL.py contains a simple RL problem to test your functions (i.e. the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestRL.py".

Skeleton code: RL.py (requires MDP.py from part I)
Simple RL problem to test your code: TestRL.py

Submit the following material via LEARN:

Your Python code.

Test your code with the maze problem described in TestRLmaze.py (same maze problem as in Part I) [1 point]. Produce a graph where the x-axis indicates the episode # (from 0 to 200) and the y-axis indicates the average (based on 100 trials) of the cumulative discounted rewards per episode (100 steps). The graph should contain 4 curves corresponding to the exploration probability epsilon=0.05, 0.1, 0.3 and 0.5 [1 point]. The initial state is 0 and the initial Q-function is 0 for all state-action pairs. Explain the impact of the exploration probability epsilon on the cumulative discounted rewards per episode earned during training as well as the resulting Q-values and policy. [1 point]

Part III [3 points]

In the third part, you will train a deep Q-network to solve the CartPole problem from Open AI Gym. This problem has continuous states that prevent the use of a tabular representation. Instead, you will use a neural network to represent the Q-function. Follow these steps to get started:

Get familiar with the CartPole problem. Read a brief description of the CartPole problem from Open AI Gym.
For this part of the assignment, you will use the PyTorch library. To get started, follow the instructions to install PyTorch and then. go through a tutorial about the basics of PyTorch.
Run the following code to solve the CartPole problem with a Deep Q-Network:
- cs885_a1_part3_code.zip

Submit the following material via LEARN:

Modify the DQN code provided to produce a graph where the y-axis is the average cumulative reward of the last 25 episodes and the x-axis is the # of episodes up to 300 episodes. The graph should contain 4 curves corresponding to updating the target network every 1, 10 (default), 50, 100 episode(s). To reduce stochasticity in the results, report curves that are the average of 5 runs corresponding to 5 random seeds [1 point]. Based on the results, explain the impact of the target network and relate the target network to value iteration. [1 point]
Modify the DQN code provided to produce a graph where the y-axis is the average cumulative reward of the last 25 episodes and the x-axis is the # of episodes up to 300 episodes. The graph should contain 4 curves corresponding to sampling mini-batches of 1, 10 (default), 50 and 100 experience(s) from the replay buffer. To reduce stochasticity in the results, report curves that are the average of 5 runs corresponding to 5 random seeds. Based on the results, explain the impact of the replay buffer and relate the replay buffer to exact gradient descent. [1 point]

Assignment 2: out January 24, due February 4 (11:59 pm)

This assignment has two parts. [10 points total + 1 bonus point]

Part I [5 points]

In the first part, you will program three bandit algorithms (epsilon-greedy, Thompson sampling and UCB) and one RL algorithms (model-based RL) in Python. More specifically, fill in the functions in the skeleton code of the file RL2.py. This file requires the file MDP.py that you programmed in Assignment 1 so make sure to include it in the same directory. The file TestRL2.py contains a simple bandit and a simple RL problem to test your functions (i.e., the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestRL2.py".

Skeleton code: RL2.py (requires MDP.py from assignment 1)
Simple RL problem to test your code: TestRL2.py

Submit the following material via LEARN:

Your Python code.

Test your code for model-based RL with the maze problem described in TestRL2Maze.py (same maze problem as in Assignment 1). Produce a graph where the x-axis indicates the episode # (from 0 to 200) and the y-axis indicates the average (based on 100 trials) of the cumulative discounted rewards per episode (100 steps) when each episode starts in state 0. The graph should contain 2 curves corresponding to
- Model-based RL (epsilon=0.05, the default transition function is uniform when a state-action pair has not been visited and the initial expected reward is 0 for all state-action pairs) [1 point]
- Q-learning (epsilon=0.05 and the initial Q-function is 0 for all state-action pairs)
Explain the results. Discuss how different properties of each algorithm influence the cumulative discounted rewards per episode earned during training as well as the resulting Q-values and policy. [1 point]
Test your code for UCB, epsilon-greedy bandit and Thompson sampling with the simple multi-armed bandit problem described in TestBandit.py. Produce a graph where the x-axis indicates the iteration # (from 0 to 200) and the y-axis indicates the average (based on 1000 trials) of the reward earned at each iteration. The graph should contain 3 curves corresponding to
- UCB [0.5 point]
- Epsilon-greedy bandit (epsilon = 1 / # iterations) [0.5 point]
- Thompson sampling (k=1 and prior consists of Beta distributions with all hyper parameters set to 1) [1 point]
Explain the results. Are the results surprising? Do they match what is expected based on the theory? [1 point]

Part II [5 points + 1 bonus point]

In the second part of the assignment, you will program the REINFORCE algorithm with a baseline and the Proximal Policy Optimization (PPO) algorithm. Use the following starter code. The complete REINFORCE algorithm is provided. Fill-in the train function for REINFORCE with baseline and PPO. You will test your code on the cartpole problem (same as assignment 1), mountain-car problem and a modified version of the mountain-car problem.

a2_part2_starter.zip (revised Jan 25)

Submit the following material via LEARN:

Your Python code.
Produce 3 graphs that show the performance of REINFORCE, REINFORCE with baseline and PPO on the cartpole problem. Running "python REINFORCE.py --mode=cartpole" produces the graph for REINFORCE which is saved in the "images" directory. For each graph, the y axis corresponds to the cumulative rewards (averaged over the last 25 episodes) and the x axis is the number of episodes (up to 800 for REINFORCE and REINFORCE with baseline, and up to 150 for PPO). [2 points] Explain the results based on the properties of each algorithm [1 point]. (Optional) What happens if you change POLICY_TRAIN_ITERS from 1 to 10 in REINFORCE with baseline? Explain your observations. [1 bonus point]
Produce 3 graphs that show the performance of REINFORCE, REINFORCE with baseline and PPO on the mountain-car problem. Read a brief description of the Mountain-Car problem from Open AI Gym. The goal is to move a car up a mountain. The mountain-car environment has a reward as follows: each time step it gives a reward of -1, until the goal is reached (at a certain height of the mountain where the reward is 0) or the episode reaches termination (in 200 steps). Running "python REINFORCE.py --mode=mountain_car" produces the graph for REINFORCE which is saved in the "images" directory. For each graph, the y axis corresponds to the cumulative rewards (averaged over the last 25 episodes) and the x axis is the number of episodes. What do you notice about the performance of the algorithms on mountain-car versus cartpole? What is a possible explanation for the performance differences in terms of the details of the environments? [1 point]
Produce 3 graphs that show the performance of REINFORCE, REINFORCE with baseline and PPO on a modified version of the mountain problem. This environment is the same as plain mountain-car, except that the reward function is changed so that the reward at each time step is the height of the car. The goal is the same as plain mountain-car (i.e., to reach the top of the mountain). Running "python REINFORCE.py --mode=mountain_car_mod" produces the graph for REINFORCE which is saved in the "images" directory. For each graph, the y axis corresponds to the cumulative rewards (averaged over the last 25 episodes) and the x axis is the number of episodes. Compare the performances of the algorithms for mountain-car versus the modified mountain-car. Specifically, compare the reward curves and the final height achieved by the agents. Explain your observations. [1 point]

Assignment 3: out February 7, due February 18 (11:59 pm)

This assignment has two parts. [10 points total]

Part I - Partially Observable RL [5 points]

Starter code: a3_part1_starter_code.zip

In this part, you will implement the Deep Recurrent Q learning (DRQN) algorithm. You will test your implementation on a modified version of the Cartpole domain. The Cartpole environment will be a POMDP where the agent will receive only a partial state (observation) of the environment instead of the full state. Start by downloading and running the implementation of DQN in Pytorch in the Cartpole domain. You should run the DQN.py file in the starter code. Implement the DRQN algorithm by filling the functions in the template code provided in the file DRQN.py. You will need to modify the model specification in DRQN.py. Note that for this assignment, you are supposed to use the partially observable environment API provided, which returns the partial state (observation) at each time step. You should not change this API to obtain the full state.

Produce a graph that shows the performance of DQN on the Cartpole domain with epsilon greedy exploration (annealed). This will serve as a baseline to compare DRQN. Run the given code without any modification and plot a graph where the x-axis is the number of episodes (2000 episodes) and the y-axis is the reward obtained. You can use a running average to smooth the plot if you so desire. Report curves that are the average of multiple runs of DQN (with different random seeds).
(1 point) Implement the DRQN model using an LSTM.
(1 point) Adjust the replay buffer specification, policy function and (optionally) utility functions to handle hidden states.
(1 point) Implement update_networks function in the DRQN.py file.
(1 point) Produce a graph that shows the performance of the LSTM based DRQN on the Cartpole domain. Similar to the DQN case, the x-axis is the number of episodes (2000 episodes) and the y-axis is the reward obtained. You can use a running average to smooth the plot if you so desire. Report curves that are the average of multiple runs of DRQN (with different random seeds).

(1 point) Based on the results explain the impact of the LSTM layer. Compare the performance of using DRQN and DQN. Elucidate the reasons for the observed difference in performance.

Submit the following material via LEARN:

Your implementation of DRQN, which will be the modified version of the DRQN.py file.
A report containing the plots pertaining to the DQN and DRQN with your explanation of the results.

Part II - Distributional RL [5 points]

Starter code: a3_part2_starter_code.zip

In this part, you will program the categorical (C51) distributional RL algorithm. The environment will be a modified version of the Cartpole domain. In this domain a noise sampled uniformly at random in the range (-0.05, 0.05) is included with the magnitude of the force applied to the pole every time an action is taken. In addition, the cart and the surface have a constant friction of 5 x 10-4 units and the pole has an air drag (friction) of 2 x 10-6 units. These changes turn the deterministic Cartpole environment into a stochastic environment. Start by downloading and running the provided implementation of DQN in Cartpole (run the DQN.py file). Now implement the C51 algorithm by modifying the functions in the C51.py file in the C51 folder.

Produce a graph that shows the performance of the given implementation of DQN on the Cartpole domain with epsilon greedy exploration (annealed). This will serve as a baseline to compare performance of the C51 algorithm. Run the given code without any modification and plot a graph where the x-axis is the number of episodes (at least 200 episodes) and the y-axis is the reward obtained per episode. You can use a running average to smooth the plot if you so desire. Report curves that are the average of multiple runs of the DQN (with different random seeds).
(2 points) Implement the C51 algorithm by modifying the C51.py file. You will need to implement update_networks.
(1 point) Adjust the policy function to use the Z network.
(1 point) Produce a graph that shows the performance of the C51 algorithm on the Cartpole domain. Similar to the DQN case, the x-axis is the number of episodes (500 episodes) and the y-axis is the reward obtained per episode. You can use a running average to smooth the plot if you so desire. Report curves that are the average of multiple runs of C51 (with different random seeds).
(1 point) Based on the results, comment on the differences in the performance of C51 and DQN. What do the results suggest about the respective algorithms?

Submit the following material via LEARN:

Your implementation of C51, which will be a modified version of the C51.py.
A report containing the plots pertaining to C51 and DQN with your explanation of the results.