The schedule below includes two tables: one for concepts (material taught by Pascal) and one for applications (papers presented by students). Readings are complementary and optional.
Table of online modules
Week | Module | Topic | Readings (textbooks) |
---|---|---|---|
Jan 7 | 1a | Course introduction (slides) | [SutBar] Chapter 1, [Sze] Chapter 1 |
1b | Markov Decision Processes, Value Iteration (slides) | [SutBar] Chap. 3, [Sze] Chap. 2, [RusNor] Sec. 15.1, 17.1-17.2, 17.4, [Put] Chap. 2, 4, 5 | |
Jan 9 | 2a | Convergence Properties (slides) | [SutBar] Sec. 4.1, 4.4, [Sze] Sec. 2.2, 2.3, [Put] Sec. 6.1-6.3, [SigBuf] Chap. 1 |
2b | Policy Iteration (slides) (annotated slides) | [SutBar] Sec. 4.3, [Put] Sec. 6.4-6.5, [SigBuf] Sec. 1.6.2.3, [RusNor] Sec. 17.3 | |
Jan 14 | 3a | Intro to Reinforcement Learning, Q-learning (slides) (annotated slides) | [SutBar] Sec. 5.1-5.3, 6.1-6.3, 6.5, [Sze] Sec. 3.1, 4.3, [SigBuf] Sec. 2.1-2.5, [RusNor] Sec. 21.1-21.3 |
3b | Deep Q-networks (slides) (annotated slides) | [GBC] Chap. 6, 7, 8, [SutBar] Sec. 9.4, 9.7, [Sze] Sec. 4.3.2 | |
Jan 16 | 4a | Policy gradient (slides) | [SutBar] Sec. 13.1-13.3, 13.7 [SigBuf] Sec. 5.1-5.2, [RusNor] Sec. 21.5 |
4b | Actor critic (slides) | [SutBar] Sec. 13.4-13.5, [Sze] Sec. 4.4, [SigBuf] Sec. 5.3 | |
Jan 21 | 5a | Trust Regions and Proximal Policies (slides) | Schulman, Levine, Moritz, Jordan, Abbeel (2015) Trust Region Policy Optimization, ICML. Schulman, Wolski, Dhariwal, Radford, Klimov (2017) Proximal Policy Optimization, arXiv. |
5b | Maximum entropy RL (slides) | Haarnoja, Tang, Abbeel, Levine (2017) Reinforcement Learning with Deep Energy-Based Policies, ICML. Haarnoja, Zhou, Abbeel, Levine (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, ICML. |
|
Jan 23 | 6a | Multi-armed bandits (slides) | [SutBar] Sec. 2.1-2.7, [Sze] Sec. 4.2.1-4.2.2 |
6b | Bayesian and Contextual bandits (slides) | [SutBar] Sec. 2.9 | |
Jan 24 | Assignment 1 due (11:59 pm) | ||
Jan 28 | 7 | Offline RL (slides) | Levine, Kumar, Tucker, Fu (2021) Offline reinforcement learning: Tutorial, review, and perspectives on open problems, arxiv. Kumar, Zhou, Tucker, Levine (2020) Conservative Q-Learning for Offline Reinforcement Learning, NeurIPS. |
Jan 30 | 8a | Model-based RL (slides) | [SutBar] Chap. 8 |
8b | Partially observable RL, DRQN (slides) | Hausknecht, M., & Stone, P. Deep recurrent Q-learning for partially observable MDPs. In 2015 AAAI fall symposium series. | |
Feb 4 | 9a | Distributional RL | Bellemare, Dabney, Munos. A distributional perspective on reinforcement learning. ICML. 2017. Bellemare, Dabney, Rolland. Distributional Reinforcement Learning, MIT Press, 2023. |
9b | Risk-Sensitive RL | ||
Feb 6 | 10 | Constrained RL | Ray, Achiam, Amodei, Benchmarking Safe Exploration in Deep Reinforcement Learning.Liu, Alev, Liu, Policy Learning with Constraints in Model-free Reinforcement Learning: A Survey, IJCAI, 2021 | Feb 7 | Assignment 2 due (11:59 pm) |
Feb 11 | 11a | Bayesian RL | Michael O’Gordon Duff’s PhD Thesis (2002) Vlassis, Ghavamzadeh, Mannor, Poupart, Bayesian Reinforcement Learning (Chapter in Reinforcement Learning: State-of-the-Art), Springer Verlag, 2012 |
11b | Meta-RL | ||
Feb 13 | 12a | Imitation Learning | Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In NeurIPS (pp. 4565-4573). Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from observation. In IJCAI (pp. 4950-4957). |
12b | Inverse RL | Ziebart, B. D., Bagnell, J. A., & Dey, A. K. (2010). Modeling interaction via the principle of maximum causal entropy. In ICML. Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. In ICML (pp. 49-58). |
|
Feb 17-21 | Reading break | ||
Feb 24 | Project proposal due (11:59 pm) | ||
Feb 25 | 13a | RL with Sequence Modeling | Esslinger, Platt & Amato (2022). Deep Transformer Q-Networks for Partially Observable Reinforcement Learning. arXiv. Chen et al.. (2021). Decision transformer: Reinforcement learning via sequence modeling. NeurIPS, 34, 15084-15097. Gu, Goel, & Ré (2022). Efficiently modeling long sequences with structured state spaces. ICLR. Gu, Dao, Ermon, Rudra & Ré (2020). Hippo: Recurrent memory with optimal polynomial projections. NeurIPS, 33, 1474-1487. |
13b | RL from human feedback |
Feb 27 | 14a | Multi-task RL | Vithayathil Varghese, N., & Mahmoud, Q. H. (2020). A survey of multi-task deep reinforcement learning. Electronics, 9(9), 1363. |
14b | RL Foundation Models | ||
Feb 28 | Assignment 3 due (11:59 pm) | ||
March 4 | 15a | Game Theory | |
15b | Multi-Agent RL |
Table of paper presentations
Date | Presenter | Discussants | Topic | Papers |
---|---|---|---|---|
March 6 | ||||
March 11 | ||||
March 13 | ||||
March 18 | ||||
March 20 | ||||
March 25 | ||||
March 27 | ||||
April 1 | ||||
April 3 | ||||
April 17 | Project report due (11:59 pm) |