There will be three assignments, each worth 10% of the final grade. Assignments are done individually (i.e., no team). Each assignment will have a programming part to be done in Python. Some assignments will make use of PyTorch, TensorFlow and OpenAI Gym. For GPU or TPU acceleration, feel free to use Google's Colaboratory environment. This is a free cloud service where you can run Python code (including TensorFlow and PyTorch, which are pre-installed) with GPU or TPU acceleration. A virtual machine with two CPUs and one GPU or TPU will run up to 12 hours after which it must be restarted. The following steps are recommended:

- Create a Python notebook in Google Colab
- Click on "edit", then "notebook settings" and select "None" (CPU), "GPU" or "TPU" for hardware acceleration.

The approximate out and due dates are:

- A1: out May 18, due May 29 (11:59 pm)
- A2: out June 8, due June 24 (11:59 pm)
- A3: out June 29, due July 10 (11:59 pm)

Your assignment should be submitted electronically via Crowdmark by the due date. You can make as many submissions as you wish up to the deadline. If you have made a submission before the deadline, you won't be able to make further submissions after the deadline. If you did not make any submission before the deadline, you can make one late submission after the deadline. A late submission will incur a 2% penalty for every rounded up hour past the deadline. For example, an assignment submitted 5 hours and 15 min late will receive a penalty of ceiling(5.25) * 2% = 12%. Assignments submitted more than 50 hours late will not be marked.

This assignment has three parts.

In the first part, you will program value iteration, policy iteration and modified policy iteration for Markov decision processes in Python. More specifically, fill in the functions in the skeleton code of the file MDP.py. The file TestMDP.py contains the simple MDP example from Lecture 2a Slides 13-14. You can verify that your code compiles properly with TestMDP.py by running "python TestMDP.py". Add print statements to this file to verify that the output of each function makes sense.

- Skeleton code: MDP.py
- Simple MDP from Lecture 2a Slides 13-14 to test your code: TestMDP.py

**Submit the following material via Crowdmark:**

- Your Python code.
- Test your code with the maze problem described in TestMDPmaze.py.
- Report the policy, value function and number of iterations needed by value iteration when using a tolerance of 0.01 and starting from a value function set to 0 for all states.
- Report the policy, value function and number of iterations needed by policy iteration to find an optimal policy when starting from the policy that chooses action 0 in all states.
- Report the number of iterations needed by modified policy iteration to converge when varying the number of iterations in partial policy evaluation from 1 to 10. Use a tolerance of 0.01, start with the policy that chooses action 0 in all states and start with the value function that assigns 0 to all states. Discuss the impact of the number of iterations in partial policy evaluation on the results and relate the results to value iteration and policy iteration.

In the second part, you will program the Q-learning algorithm in Python. More specifically, fill in the functions in the skeleton code of the file RL.py. This file requires the file MDP.py that you programmed for part I so make sure to include it in the same directory. The file TestRL.py contains a simple RL problem to test your functions (i.e. the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestRL.py".

**Submit the following material via Crowdmark:**

- Your Python code.
- Test your code with the maze problem described in TestRLmaze.py (same maze problem as in Part I). Produce a graph where the x-axis indicates the episode # (from 0 to 200) and the y-axis indicates the average (based on 100 trials) of the cumulative discounted rewards per episode (100 steps). The graph should contain 4 curves corresponding to the exploration probability epsilon=0.05, 0.1, 0.3 and 0.5. The initial state is 0 and the initial Q-function is 0 for all state-action pairs. Explain the impact of the exploration probability epsilon on the cumulative discounted rewards per episode earned during training as well as the resulting Q-values and policy.

In the third part, you will train a deep Q-network to solve the CartPole problem from Open AI Gym. This problem has continuous states that prevent the use of a tabular representation. Instead, you will use a neural network to represent the Q-function. Follow these steps to get started:

- Get familiar with the CartPole problem. Read a brief description of the CartPole problem from Open AI Gym.
- For this part of the assignment, you will use the Agents library from TensorFlow. Watch the following video for an introduction.
- Run the code in the following tutorial to solve the CartPole problem with a Deep Q-Network:
- Python script: cs885_dqn_tutorial.py
- Colab Notebook: cs885_dqn_tutorial.ipynb

**Submit the following material via Crowdmark:**

- Modify and run the code in the CartPole DQN tutorial to produce a graph where the y-axis is the average cumulative reward of 10 episodes every 1000 iterations and the x-axis is the # of iterations up to a maximum of 20000 iterations. The graph should contain 4 curves corresponding to updating the target network every 1, 10 (default), 30, 100 iteration(s). To reduce stochasticity in the results, feel free to report curves that are the average of multiple runs of the tutorial code. Based on the results, explain the impact of the target network and relate the target network to value iteration.
- Modify and run the code in the CartPole DQN tutorial to produce a graph where the y-axis is the average cumulative reward of 10 episodes every 1000 iterations and the x-axis is the # of iterations up to a maximum of 20000 iterations. The graph should contain 4 curves corresponding to sampling mini-batches of 1, 30 (default), 50 and 200 experience(s) from the replay buffer. To reduce stochasticity in the results, feel free to report curves that are the average of multiple runs of the tutorial code. Based on the results, explain the impact of the replay buffer and relate the replay buffer to exact gradient descent.

This assignment has two parts.

In the first part, you will program three bandit algorithms (epsilon-greedy, Thompson sampling and UCB) and two RL algorithms (REINFORCE and model-based RL) in Python. More specifically, fill in the functions in the skeleton code of the file RL2.py. This file requires the file MDP.py that you programmed in Assignment 1 so make sure to include it in the same directory. The file TestRL2.py contains a simple bandit and a simple RL problem to test your functions (i.e. the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestRL2.py".

- Skeleton code: RL2.py (requires MDP.py from assignment 1)
- Simple RL problem to test your code: TestRL2.py

**Submit the following material via Crowdmark:**

- Your Python code.
- Test your code for REINFORCE and model-based RL with the maze problem described in TestRL2Maze.py (same maze problem as in Assignment 1). Produce a graph where the x-axis indicates the episode # (from 0 to 200) and the y-axis indicates the average (based on 100 trials) of the cumulative discounted rewards per episode (100 steps) when each episode starts in state 0. The graph should contain 3 curves corresponding to
- REINFORCE (alpha=0.01, the initial policy parameters are chosen uniformly at random for each state-action pair)
- Model-based RL (epsilon=0.05, the default transition function is uniform when a state-action pair has not been visited and the initial expected reward is 0 for all state-action pairs)
- Q-learning (epsilon=0.05 and the initial Q-function is 0 for all state-action pairs)

- Test your code for UCB, epsilon-greedy bandit and Thompson sampling with the simple multi-armed bandit problem described in TestBandit.py. Produce a graph where the x-axis indicates the iteration # (from 0 to 200) and the y-axis indicates the average (based on 1000 trials) of the reward earned at each iteration. The graph should contain 3 curves corresponding to
- UCB
- Epsilon-greedy bandit (epsilon = 1 / # iterations)
- Thompson sampling (k=1 and prior consists of Beta distributions with all hyper parameters set to 1)

In the second part of the assignment, you will program the Soft Q-Learning and Soft Actor Critic algorithms (see the module on Maximum Entropy Reinforcement Learning). You will test your implementation on the cartpole problem. Start by downloading and running the following implementation of DQN in Pytorch. Then modify this DQN implementation to obtain Soft Q-Learning and Soft Actor Critic.

**Submit the following material via Crowdmark:**

- Your Python code.
- Produce a graph that shows the performance of DQN on the cartpole problem with epsilon greedy exploration. This will serve as a baseline to compare Soft Q-Learning and Soft Actor Cirtic. Run the code above without any modification and it should automatically produce a graph where the x-axis is the number of episodes (up to 2000 episodes) and the y-axis is the number of steps before the pole falls. There will be two curves: a) number of steps before the pole falls in each episode, b) average number of steps before the pole falls for the past 100 episodes.
- Produce 4 graphs that show the performance of Soft Q-Learning on the cartpole problem for 4 different values of the temperature lambda: 0.02, 0.05, 0.1, 0.2. In your implementation of Soft Q-Learning, do exploration by sampling from the softmax policy. The four graphs should each have 2 curves that show the number of steps before the pole falls in each episode and the average number of steps before the pole falls in the last 100 episodes.
- Produce 4 graphs that show the performance of Soft Actor Critic on the cartpole problem for 4 different values of the temperature lambda: 0.02, 0.05, 0.1, 0.2. In your implementation of Soft Actor Critic, do exploration by sampling actions from the policy network, which encodes a stochastic actor. The four graphs should each have 2 curves that show the number of steps before the pole falls in each epiosde and the average number of steps before the pole falls in the last 100 episodes.
- Explain the results. Discuss how different properties of each algorithm influence the number of steps before the poll falls. Discuss also the impact of the temperature lambda.