Home Goals Textbook Schedule Assignments Critiques Presentation Project Marks Policies Pascal's Homepage

CS885 Spring 2020 - Reinforcement Learning

There will be three assignments, each worth 10% of the final grade. Assignments are done individually (i.e., no team). Each assignment will have a programming part to be done in Python. Some assignments will make use of PyTorch, TensorFlow and OpenAI Gym. For GPU or TPU acceleration, feel free to use Google's Colaboratory environment. This is a free cloud service where you can run Python code (including TensorFlow and PyTorch, which are pre-installed) with GPU or TPU acceleration. A virtual machine with two CPUs and one GPU or TPU will run up to 12 hours after which it must be restarted. The following steps are recommended:

The approximate out and due dates are:

Your assignment should be submitted electronically via Crowdmark by the due date. You can make as many submissions as you wish up to the deadline. If you have made a submission before the deadline, you won't be able to make further submissions after the deadline. If you did not make any submission before the deadline, you can make one late submission after the deadline. A late submission will incur a 2% penalty for every rounded up hour past the deadline. For example, an assignment submitted 5 hours and 15 min late will receive a penalty of ceiling(5.25) * 2% = 12%. Assignments submitted more than 50 hours late will not be marked.

Assignment 1: out May 18, due May 29 (11:59 pm)

This assignment has three parts.

Part I

In the first part, you will program value iteration, policy iteration and modified policy iteration for Markov decision processes in Python. More specifically, fill in the functions in the skeleton code of the file MDP.py. The file TestMDP.py contains the simple MDP example from Lecture 2a Slides 13-14. You can verify that your code compiles properly with TestMDP.py by running "python TestMDP.py". Add print statements to this file to verify that the output of each function makes sense.

Submit the following material via Crowdmark:

Part II

In the second part, you will program the Q-learning algorithm in Python. More specifically, fill in the functions in the skeleton code of the file RL.py. This file requires the file MDP.py that you programmed for part I so make sure to include it in the same directory. The file TestRL.py contains a simple RL problem to test your functions (i.e. the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestRL.py".

Submit the following material via Crowdmark:

Part III

In the third part, you will train a deep Q-network to solve the CartPole problem from Open AI Gym. This problem has continuous states that prevent the use of a tabular representation. Instead, you will use a neural network to represent the Q-function. Follow these steps to get started:

Submit the following material via Crowdmark:

Assignment 2: out June 8, due June 24 (11:59 pm)

This assignment has two parts.

Part I

In the first part, you will program three bandit algorithms (epsilon-greedy, Thompson sampling and UCB) and two RL algorithms (REINFORCE and model-based RL) in Python. More specifically, fill in the functions in the skeleton code of the file RL2.py. This file requires the file MDP.py that you programmed in Assignment 1 so make sure to include it in the same directory. The file TestRL2.py contains a simple bandit and a simple RL problem to test your functions (i.e. the output of each function will be printed to the screen). You can verify that your code compiles properly by running "python TestRL2.py".

Submit the following material via Crowdmark:

Part II

In the second part of the assignment, you will program the Soft Q-Learning and Soft Actor Critic algorithms (see the module on Maximum Entropy Reinforcement Learning). You will test your implementation on the cartpole problem. Start by downloading and running the following implementation of DQN in Pytorch. Then modify this DQN implementation to obtain Soft Q-Learning and Soft Actor Critic.

Submit the following material via Crowdmark:

  • Your Python code.
  • Produce a graph that shows the performance of DQN on the cartpole problem with epsilon greedy exploration. This will serve as a baseline to compare Soft Q-Learning and Soft Actor Cirtic. Run the code above without any modification and it should automatically produce a graph where the x-axis is the number of episodes (up to 2000 episodes) and the y-axis is the number of steps before the pole falls. There will be two curves: a) number of steps before the pole falls in each episode, b) average number of steps before the pole falls for the past 100 episodes.
  • Produce 4 graphs that show the performance of Soft Q-Learning on the cartpole problem for 4 different values of the temperature lambda: 0.02, 0.05, 0.1, 0.2. In your implementation of Soft Q-Learning, do exploration by sampling from the softmax policy. The four graphs should each have 2 curves that show the number of steps before the pole falls in each episode and the average number of steps before the pole falls in the last 100 episodes.
  • Produce 4 graphs that show the performance of Soft Actor Critic on the cartpole problem for 4 different values of the temperature lambda: 0.02, 0.05, 0.1, 0.2. In your implementation of Soft Actor Critic, do exploration by sampling actions from the policy network, which encodes a stochastic actor. The four graphs should each have 2 curves that show the number of steps before the pole falls in each epiosde and the average number of steps before the pole falls in the last 100 episodes.
  • Explain the results. Discuss how different properties of each algorithm influence the number of steps before the poll falls. Discuss also the impact of the temperature lambda.