CS885 Fall 2021 - Reinforcement Learning

There will be three assignments, each worth 10% of the final grade. Assignments are done individually (i.e., no team). Each assignment will have a programming part to be done in Python. Some assignments will make use of PyTorch, TensorFlow and OpenAI Gym. For GPU or TPU acceleration, feel free to use Google's Colaboratory environment. This is a free cloud service where you can run Python code (including TensorFlow and PyTorch, which are pre-installed) with GPU or TPU acceleration. A virtual machine with two CPUs and one GPU or TPU will run up to 12 hours after which it must be restarted. The following steps are recommended:

Create a Python notebook in Google Colab
Click on "edit", then "notebook settings" and select "None" (CPU), "GPU" or "TPU" for hardware acceleration.

The approximate out and due dates are:

A1: out Sept 13, due Sept 24 (11:59 pm)
A2: out Sept 27, due Oct 8 (11:59 pm)
A3: out Oct 11, due Oct 22 (11:59 pm)

Your assignment should be submitted electronically via LEARN. You can make as many submissions as you wish. Your last submission will be marked. A late submission will incur a 2% penalty for every rounded up hour past the deadline. For example, an assignment submitted 5 hours and 15 minutes late will receive a penalty of ceiling(5.25) * 2% = 12%. Assignments submitted more than 50 hours late will not be marked.

Assignment 1: out Sept 13, due Sept 24 (11:59 pm)

This assignment has three parts. [10 points total]

Part I [4 points]

In the first part, you will program value iteration, policy iteration and modified policy iteration for Markov decision processes in Python. More specifically, fill in the functions in the skeleton code of the file MDP.py. The file TestMDP.py contains the simple MDP example from Lecture 2a Slides 13-14. You can verify that your code compiles properly with TestMDP.py by running "python TestMDP.py". Add print statements to this file to verify that the output of each function makes sense.

Skeleton code: MDP.py
Simple MDP from Lecture 2a Slides 13-14 to test your code: TestMDP.py

Submit the following material via LEARN:

Your Python code.
Test your code with the maze problem described in TestMDPmaze.py.

Report the policy, value function and number of iterations needed by value iteration when using a tolerance of 0.01 and starting from a value function set to 0 for all states. [1 point]
Report the policy, value function and number of iterations needed by policy iteration to find an optimal policy when starting from the policy that chooses action 0 in all states. [1 point]
Report the number of iterations needed by modified policy iteration to converge when varying the number of iterations in partial policy evaluation from 1 to 10. Use a tolerance of 0.01, start with the policy that chooses action 0 in all states and start with the value function that assigns 0 to all states. [1 point] Discuss the impact of the number of iterations in partial policy evaluation on the results and relate the results to value iteration and policy iteration. [1 point]

Part II [3 points]

In the second part, you will program the Q-learning algorithm in Python. More specifically, fill in the functions in the skeleton code of the file RL.py. This file requires the file MDP.py that you programmed for part I so make sure to include it in the same directory. The file TestRL.py contains a simple RL problem to test your functions (i.e. the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestRL.py".

Skeleton code: RL.py (requires MDP.py from part I)
Simple RL problem to test your code: TestRL.py

Submit the following material via LEARN:

Your Python code.

Test your code with the maze problem described in TestRLmaze.py (same maze problem as in Part I) [1 point]. Produce a graph where the x-axis indicates the episode # (from 0 to 200) and the y-axis indicates the average (based on 100 trials) of the cumulative discounted rewards per episode (100 steps). The graph should contain 4 curves corresponding to the exploration probability epsilon=0.05, 0.1, 0.3 and 0.5 [1 point]. The initial state is 0 and the initial Q-function is 0 for all state-action pairs. Explain the impact of the exploration probability epsilon on the cumulative discounted rewards per episode earned during training as well as the resulting Q-values and policy. [1 point]

Part III [3 points]

In the third part, you will train a deep Q-network to solve the CartPole problem from Open AI Gym. This problem has continuous states that prevent the use of a tabular representation. Instead, you will use a neural network to represent the Q-function. Follow these steps to get started:

Get familiar with the CartPole problem. Read a brief description of the CartPole problem from Open AI Gym.
For this part of the assignment, you will use the Agents library from TensorFlow. Watch the following video for an introduction.
Run the code in the following tutorial to solve the CartPole problem with a Deep Q-Network:
- Python script: cs885_dqn_tutorial.py
- Colab Notebook: cs885_dqn_tutorial.ipynb
To avoid trouble when installing the necessary software, it is recommended to run the Colab Notebook in Google Colab (where all the necessary libraries are available). If you are not familiar with Colab, just create an account at Google Colab and follow the tutorials. Google Colab is very nice to run Python code without the hassle of installing libraries and to leverage a GPU or TPU. Note however that the code in the above tutorial is simple and therefore will not be accelerated by a GPU or TPU.

Submit the following material via LEARN:

Modify and run the code in the CartPole DQN tutorial to produce a graph where the y-axis is the average cumulative reward of 10 episodes every 1000 iterations and the x-axis is the # of iterations up to a maximum of 20000 iterations. The graph should contain 4 curves corresponding to updating the target network every 1, 10 (default), 30, 100 iteration(s). To reduce stochasticity in the results, feel free to report curves that are the average of multiple runs of the tutorial code. Based on the results, explain the impact of the target network and relate the target network to value iteration. [1 point]
Modify and run the code in the CartPole DQN tutorial to produce a graph where the y-axis is the average cumulative reward of 10 episodes every 1000 iterations and the x-axis is the # of iterations up to a maximum of 20000 iterations. The graph should contain 4 curves corresponding to sampling mini-batches of 1, 30 (default), 50 and 200 experience(s) from the replay buffer. To reduce stochasticity in the results, feel free to report curves that are the average of multiple runs of the tutorial code [1 point]. Based on the results, explain the impact of the replay buffer and relate the replay buffer to exact gradient descent. [1 point]

Assignment 2: out September 27, due October 8 (11:59 pm)

This assignment has two parts. [10 points total + 1 bonus point]

Part I [6 points]

In the first part, you will program three bandit algorithms (epsilon-greedy, Thompson sampling and UCB) and two RL algorithms (REINFORCE and model-based RL) in Python. More specifically, fill in the functions in the skeleton code of the file RL2.py. This file requires the file MDP.py that you programmed in Assignment 1 so make sure to include it in the same directory. The file TestRL2.py contains a simple bandit and a simple RL problem to test your functions (i.e., the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestRL2.py".

Skeleton code: RL2.py (requires MDP.py from assignment 1)
Simple RL problem to test your code: TestRL2.py

Submit the following material via LEARN:

Your Python code.

Test your code for REINFORCE and model-based RL with the maze problem described in TestRL2Maze.py (same maze problem as in Assignment 1). Produce a graph where the x-axis indicates the episode # (from 0 to 200) and the y-axis indicates the average (based on 100 trials) of the cumulative discounted rewards per episode (100 steps) when each episode starts in state 0. The graph should contain 3 curves corresponding to
- REINFORCE (alpha=0.01, the initial policy parameters are chosen uniformly at random for each state-action pair) [1 point]
- Model-based RL (epsilon=0.05, the default transition function is uniform when a state-action pair has not been visited and the initial expected reward is 0 for all state-action pairs) [1 point]
- Q-learning (epsilon=0.05 and the initial Q-function is 0 for all state-action pairs)
Explain the results. Discuss how different properties of each algorithm influence the cumulative discounted rewards per episode earned during training as well as the resulting Q-values and policy. [1 point]
Test your code for UCB, epsilon-greedy bandit and Thompson sampling with the simple multi-armed bandit problem described in TestBandit.py. Produce a graph where the x-axis indicates the iteration # (from 0 to 200) and the y-axis indicates the average (based on 1000 trials) of the reward earned at each iteration. The graph should contain 3 curves corresponding to
- UCB [0.5 point]
- Epsilon-greedy bandit (epsilon = 1 / # iterations) [0.5 point]
- Thompson sampling (k=1 and prior consists of Beta distributions with all hyper parameters set to 1) [1 point]
Explain the results. Are the results surprising? Do they match what is expected based on the theory? [1 point]

Part II [4 points + 1 bonus point]

In the second part of the assignment, you will program the Soft Q-Learning and Soft Actor Critic algorithms (see the module on Maximum Entropy Reinforcement Learning). You will test your implementation on the cartpole problem, and the pendulum problem (bonus). Start by downloading and running the following implementation of DQN in Pytorch. Then modify this DQN implementation to obtain Soft Q-Learning and Soft Actor Critic. Finally, for bonus marks, you may extend your Soft Actor Critic implementation to work with continuous actions and test it on the Pendulum environment.

CS885_fall21_a2_part2.zip

Submit the following material via LEARN:

Your Python code.
Produce a graph that shows the performance of DQN on the cartpole problem with epsilon greedy exploration. This will serve as a baseline to compare Soft Q-Learning and Soft Actor Cirtic. Run the code above without any modification and it should automatically produce a graph where the x-axis is the number of episodes (up to 300 episodes) and the y-axis is the number of steps before the pole falls.
Produce a graph with three curves that respectively show the performance of DQN, Soft Q-learning, and Soft Actor Critic on the cartpole problem, across 5 trials. Use a constant temperature of 10. In your implementation of Soft Q-Learning, do exploration by sampling from the softmax policy. In the implementation of Soft Actor Critic, do exploration by sampling actions from the policy network, which encodes a stochastic actor. The three curves should show the number of steps before the pole falls in each episode. [1 point]
Produce a graph with four curves that show the performance of Soft Q-Learning on the cartpole problem for 4 different values of the temperature lambda: 1, 10 (default), 100, 1000, across 5 trials. In your implementation of Soft Q-Learning, do exploration by sampling from the softmax policy. The four curves should show the number of steps before the pole falls in each episode. [1 point]
Produce a graph with four curves that show the performance of Soft Actor Critic on the cartpole problem for 4 different values of the temperature lambda: 1, 10 (default), 100, 1000, across 5 trials. In your implementation of Soft Actor Critic, do exploration by sampling actions from the policy network, which encodes a stochastic actor. The four curves should show the number of steps before the pole falls in each epiosde. [1 point]
Explain the results. Discuss how different properties of each algorithm influence the number of steps before the pole falls. Discuss also the impact of the temperature lambda. [1 point]
(Bonus) Produce a graph that shows the performance of Soft Actor Critic on the Pendulum environment. Note that Pendulum is an environment with a continuous action space, so you may refer to Soft Actor-Critic Algorithms and Applications for key equations and the respective algorithm. The curve should show the average episodic reward (y-axis, averaged across 5 trials) against the number of episodes (x-axis). [1 bonus point]

Assignment 3: out October 12, due October 27 (11:59 pm)

This assignment has three parts. [10 points total]

Part I - Partially Observable RL [3 points]

Starter code: cs885_fall21_a3_part1.zip

In this part, you will implement the Deep Recurrent Q learning (DRQN) algorithm. You will test your implementation on a modified version of the Cartpole domain. The Cartpole environment will be a POMDP where the agent will receive only a partial state (observation) of the environment instead of the full state. Start by downloading and running the implementation of DQN in Pytorch in the Cartpole domain (make sure that all the packages in the given requirements.txt file are installed). You should run the train.py file in the DQN folder. This DQN uses a stack of the last 4 observations as an approximation to its state, since this is a partially observable domain. The stack of 4 observations helps the agent learn the temporal dependencies between the states. Now implement the DRQN algorithm by filling the functions in the template code of the DRQN folder. You need to modify the model.py file in the DRQN folder. Note that for this assignment, you are supposed to use the partially observable environment API provided, which returns the partial state (observation) at each time step. You should not change this API to obtain the full state.

Produce a graph that shows the performance of DQN on the Cartpole domain with epsilon greedy exploration. This will serve as a baseline to compare DRQN. Run the given code without any modification and plot a graph where the x-axis is the number of episodes (at least 5000 episodes) and the y-axis is the reward obtained. You could choose to use a running average to smooth the plot if you so desire. To reduce stochasticity in the results, feel free to report curves that are the average of multiple runs of DQN (with different random seeds).
(1 point) Now implement the DRQN algorithm by modifying the model.py file in the given DRQN folder. Implement your DRQN using LSTM (use the torch.nn.LSTM function to implement the LSTM layer(s)).
(1 point) Produce a graph that shows the performance of the LSTM based DRQN on the Cartpole domain. Similar to the DQN case, the x-axis is the number of episodes (at least 5000 episodes) and the y-axis is the reward obtained. You could choose to use a running average to smooth the plot if you so desire. To reduce stochasticity in the results, feel free to report curves that are the average of multiple runs of DRQN (with different random seeds).
(1 point) Based on the results explain the impact of the LSTM layer. Compare the performance of using DRQN, and DQN with a stack of 4 consecutive observations. Elucidate the reasons for the observed difference in performance.
(Optional) You could choose to change the hyperparameters defined in the config.py file of the DRQN folder for better performance. You could also add new parameters to this file if you need. If you choose to do this, make sure to submit your modified config.py file as well. Also indicate the values for the hyperparameters in your report.

Submit the following material via LEARN:

Your implementation of DRQN, which will be the modified version of model.py script in the DRQN folder.
A report containing the plots pertaining to the DQN and DRQN with your explanation of the results.
(Optional) If you choose to change any hyperparameter values, you should submit the config.py file associated with your implementation of DRQN as well.

Part II - Imitation Learning [3 points]

Starter code: cs885_fall21_a3_part2.zip

Task Description. In this part of the assignment, you will program a Generative Adversarial Imitation Learning (GAIL) algorithm. Inside GAIL, update the policy parameters with a deterministic policy gradient update technique instead of Trust Region Policy Optimization (TRPO), since we are using a deterministic policy for this part. The value of lambda in the pseudocode of GAIL should be set to 0 for this assignment. To be more specific, fill in the function train() in the class GAIL(). You will find an example implementation of Behavior Cloning (BC) as discussed in the lectures. It is suggested to start from this example as you work on your implementation.

Running Requirement. This code uses the BipedalWalker-v2 environment from Open AI gym. The expert trajectories can be found in expert_traj/BipedalWalker-v2, and the required python packages can be found in the file requirement.txt. Make sure to install these packages before running the code. Note that some versions of packages needed for this part are different from the previous parts. All the configurations are defined in the function init_config(). Please use the default model parameters during training, but if you are not using a GPU, set --device to cpu (for example: python irl_BipedalWalker.py --device cpu).

Produce two plots that record the average rewards (running average of 10) and training loss against the number of training steps (at least 50000) for the BC algorithm. To reduce stochasticity in the results, feel free to report curves that are the average of multiple runs of BC (with different random seeds).
(1 point) Fill in the function train() in the class GAIL() with a deterministic policy gradient update technique. Use the value of lambda=0 in the pseudocode of GAIL.
(1 point) Produce three plots that record the average testing rewards (running average of 10), discriminator loss, and actor loss against the number of training steps (at least 50000) for the GAIL algorithm.
(1 point) Explain the results by comparing the performance of BC and GAIL. Describe how their losses and average rewards behave during training and explain their behaviour. Focus on comparing the difference between adversarial learning and supervised learning.

Submit the following material via LEARN:

Your implementation of GAIL, which will be the modified version of irl_BipedalWalker.py script.
A report containing the plots pertaining to BC and GAIL with your explanation of the results.

Part III - Distributional RL [4 points]

Starter code: cs885_fall21_a3_part3.zip

In this part, you will program the categorical (C51) distributional RL algorithm. The environment will be a modified version of the Cartpole domain. In this domain a noise sampled uniformly at random in the range (-0.05, 0.05) is included with the magnitude of the force applied to the pole every time an action is taken. In addition, the cart and the surface have a constant friction of 5 x 10^-4 units and the pole has an air drag (friction) of 2 x 10^-6 units. These changes turn the deterministic Cartpole environment into a stochastic environment. Start by downloading and running the provided implementation of DQN in Cartpole (run the dqn.py file in the DQN folder). Now implement the C51 algorithm by modifying the functions in the c51.py file in the C51 folder. Specifically you will need to implement the functions train_networks and compute_targets

Produce a graph that shows the performance of the given implementation of DQN on the Cartpole domain with epsilon greedy exploration. This will serve as a baseline to compare performance of the C51 algorithm. Run the given code without any modification and plot a graph where the x-axis is the number of episodes (at least 200 episodes) and the y-axis is the reward obtained per episode. Before running the code, make sure to install all the Python libraries required as reflected in the requirements.txt file. Also install the new (modified) Cartpole environment by running ``pip install -e .". You can choose to use a running average to smooth the plot if you so desire. To reduce stochasticity in the results, feel free to report curves that are the average of multiple runs of the DQN (with different random seeds).
(2 points) Now implement the C51 algorithm by modifying the c51.py file. You will need to implement two functions train_networks and compute_targets.
(1 point) Produce a graph that shows the performance of the C51 algorithm on the Cartpole domain. Similar to the DQN case, the x-axis is the number of episodes (at least 200 episodes) and the y-axis is the reward obtained per episode. You could choose to use a running average to smooth the plot if you so desire. To reduce stochasticity in the results, feel free to report curves that are the average of multiple runs of C51 (with different random seeds).
(1 point) Based on the results comment on the differences in the performance of C51 and DQN. Do the results say anything about the respective algorithms?

Submit the following material via LEARN:

Your implementation of C51, which will be a modified version of the c51.py.
A report containing the plots pertaining to C51 and DQN with your explanation of the results.