Home Goals Textbook Schedule Assignments Critiques Presentation Project Marks Policies Pascal's Homepage

CS885 Winter 2025 - Reinforcement Learning

The schedule below includes two tables: one for concepts (material taught by Pascal) and one for applications (papers presented by students). Readings are complementary and optional.

Table of online modules

Week Module Topic Readings (textbooks)
Jan 7 1a Course introduction (slides) [SutBar] Chapter 1, [Sze] Chapter 1
1b Markov Decision Processes, Value Iteration (slides) [SutBar] Chap. 3, [Sze] Chap. 2, [RusNor] Sec. 15.1, 17.1-17.2, 17.4, [Put] Chap. 2, 4, 5
Jan 9 2a Convergence Properties (slides) [SutBar] Sec. 4.1, 4.4, [Sze] Sec. 2.2, 2.3, [Put] Sec. 6.1-6.3, [SigBuf] Chap. 1
2b Policy Iteration (slides) (annotated slides) [SutBar] Sec. 4.3, [Put] Sec. 6.4-6.5, [SigBuf] Sec. 1.6.2.3, [RusNor] Sec. 17.3
Jan 14 3a Intro to Reinforcement Learning, Q-learning (slides) (annotated slides) [SutBar] Sec. 5.1-5.3, 6.1-6.3, 6.5, [Sze] Sec. 3.1, 4.3, [SigBuf] Sec. 2.1-2.5, [RusNor] Sec. 21.1-21.3
3b Deep Q-networks (slides) (annotated slides) [GBC] Chap. 6, 7, 8, [SutBar] Sec. 9.4, 9.7, [Sze] Sec. 4.3.2
Jan 16 4a Policy gradient (slides) [SutBar] Sec. 13.1-13.3, 13.7 [SigBuf] Sec. 5.1-5.2, [RusNor] Sec. 21.5
4b Actor critic (slides) [SutBar] Sec. 13.4-13.5, [Sze] Sec. 4.4, [SigBuf] Sec. 5.3
Jan 21 5a Trust Regions and Proximal Policies (slides) Schulman, Levine, Moritz, Jordan, Abbeel (2015) Trust Region Policy Optimization, ICML.
Schulman, Wolski, Dhariwal, Radford, Klimov (2017) Proximal Policy Optimization, arXiv.
5b Maximum entropy RL (slides) Haarnoja, Tang, Abbeel, Levine (2017) Reinforcement Learning with Deep Energy-Based Policies, ICML.
Haarnoja, Zhou, Abbeel, Levine (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, ICML.
Jan 23 6a Multi-armed bandits (slides) [SutBar] Sec. 2.1-2.7, [Sze] Sec. 4.2.1-4.2.2
6b Bayesian and Contextual bandits (slides) [SutBar] Sec. 2.9
Jan 24 Assignment 1 due (11:59 pm)
Jan 28 7 Offline RL (slides) Levine, Kumar, Tucker, Fu (2021) Offline reinforcement learning: Tutorial, review, and perspectives on open problems, arxiv.
Kumar, Zhou, Tucker, Levine (2020) Conservative Q-Learning for Offline Reinforcement Learning, NeurIPS.
Jan 30 8a Model-based RL (slides) [SutBar] Chap. 8
8b Partially observable RL, DRQN (slides) Hausknecht, M., & Stone, P. Deep recurrent Q-learning for partially observable MDPs. In 2015 AAAI fall symposium series.
Feb 4 9a Distributional RL Bellemare, Dabney, Munos. A distributional perspective on reinforcement learning. ICML. 2017.
Bellemare, Dabney, Rolland. Distributional Reinforcement Learning, MIT Press, 2023.
9b Risk-Sensitive RL
Feb 6 10 Constrained RL Ray, Achiam, Amodei, Benchmarking Safe Exploration in Deep Reinforcement Learning. Liu, Alev, Liu, Policy Learning with Constraints in Model-free Reinforcement Learning: A Survey, IJCAI, 2021
Feb 7 Assignment 2 due (11:59 pm)
Feb 11 11a Bayesian RL Michael O’Gordon Duff’s PhD Thesis (2002)
Vlassis, Ghavamzadeh, Mannor, Poupart, Bayesian Reinforcement Learning (Chapter in Reinforcement Learning: State-of-the-Art), Springer Verlag, 2012
11b Meta-RL
Feb 13 12a Imitation Learning Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In NeurIPS (pp. 4565-4573).
Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from observation. In IJCAI (pp. 4950-4957).
12b Inverse RL Ziebart, B. D., Bagnell, J. A., & Dey, A. K. (2010). Modeling interaction via the principle of maximum causal entropy. In ICML.
Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. In ICML (pp. 49-58).
Feb 17-21 Reading break
Feb 24 Project proposal due (11:59 pm)
Feb 25 13a RL with Sequence Modeling Esslinger, Platt & Amato (2022). Deep Transformer Q-Networks for Partially Observable Reinforcement Learning. arXiv.
Chen et al.. (2021).
Decision transformer: Reinforcement learning via sequence modeling. NeurIPS, 34, 15084-15097.
Gu, Goel, & Ré (2022). Efficiently modeling long sequences with structured state spaces. ICLR.
Gu, Dao, Ermon, Rudra & Ré (2020). Hippo: Recurrent memory with optimal polynomial projections. NeurIPS, 33, 1474-1487.
13b RL from human feedback
Feb 27 14a Multi-task RL Vithayathil Varghese, N., & Mahmoud, Q. H. (2020). A survey of multi-task deep reinforcement learning. Electronics, 9(9), 1363.
14b RL Foundation Models
Feb 28 Assignment 3 due (11:59 pm)
March 4 15a Game Theory
15b Multi-Agent RL

Table of paper presentations

Date Presenter Discussants Topic Papers
March 6
March 11
March 13
March 18
March 20
March 25
March 27
April 1
April 3
April 17 Project report due (11:59 pm)