The schedule below includes two tables: one for concepts (material taught by Pascal) and one for applications (papers presented by students). Readings are complementary and optional.
Table of online modules
Week | Module | Topic | Readings (textbooks) |
---|---|---|---|
Jan 7 | 1a | Course introduction (slides) | [SutBar] Chapter 1, [Sze] Chapter 1 |
1b | Markov Decision Processes, Value Iteration (slides) | [SutBar] Chap. 3, [Sze] Chap. 2, [RusNor] Sec. 15.1, 17.1-17.2, 17.4, [Put] Chap. 2, 4, 5 | |
Jan 9 | 2a | Convergence Properties (slides) | [SutBar] Sec. 4.1, 4.4, [Sze] Sec. 2.2, 2.3, [Put] Sec. 6.1-6.3, [SigBuf] Chap. 1 |
2b | Policy Iteration (slides) (annotated slides) | [SutBar] Sec. 4.3, [Put] Sec. 6.4-6.5, [SigBuf] Sec. 1.6.2.3, [RusNor] Sec. 17.3 | |
Jan 14 | 3a | Intro to Reinforcement Learning, Q-learning (slides) (annotated slides) | [SutBar] Sec. 5.1-5.3, 6.1-6.3, 6.5, [Sze] Sec. 3.1, 4.3, [SigBuf] Sec. 2.1-2.5, [RusNor] Sec. 21.1-21.3 |
3b | Deep Q-networks (slides) (annotated slides) | [GBC] Chap. 6, 7, 8, [SutBar] Sec. 9.4, 9.7, [Sze] Sec. 4.3.2 | |
Jan 16 | 4a | Policy gradient (slides) | [SutBar] Sec. 13.1-13.3, 13.7 [SigBuf] Sec. 5.1-5.2, [RusNor] Sec. 21.5 |
4b | Actor critic (slides) | [SutBar] Sec. 13.4-13.5, [Sze] Sec. 4.4, [SigBuf] Sec. 5.3 | |
Jan 21 | 5a | Trust Regions and Proximal Policies (slides) | Schulman, Levine, Moritz, Jordan, Abbeel (2015) Trust Region Policy Optimization, ICML. Schulman, Wolski, Dhariwal, Radford, Klimov (2017) Proximal Policy Optimization, arXiv. |
5b | Maximum entropy RL (slides) | Haarnoja, Tang, Abbeel, Levine (2017) Reinforcement Learning with Deep Energy-Based Policies, ICML. Haarnoja, Zhou, Abbeel, Levine (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, ICML. |
|
Jan 23 | 6a | Multi-armed bandits (slides) | [SutBar] Sec. 2.1-2.7, [Sze] Sec. 4.2.1-4.2.2 |
6b | Bayesian and Contextual bandits (slides) | [SutBar] Sec. 2.9 | |
Jan 24 | Assignment 1 due (11:59 pm) | ||
Jan 28 | 7 | Offline RL (slides) | Levine, Kumar, Tucker, Fu (2021) Offline reinforcement learning: Tutorial, review, and perspectives on open problems, arxiv. Kumar, Zhou, Tucker, Levine (2020) Conservative Q-Learning for Offline Reinforcement Learning, NeurIPS. |
Jan 30 | 8a | Model-based RL (slides) | [SutBar] Chap. 8 |
8b | Partially observable RL, DRQN (slides) | Hausknecht, M., & Stone, P. Deep recurrent Q-learning for partially observable MDPs. In 2015 AAAI fall symposium series. | |
Feb 4 | 9 | Distributional RL (slides) | Bellemare, Dabney, Munos. A distributional perspective on reinforcement learning. ICML. 2017. Bellemare, Dabney, Rolland. Distributional Reinforcement Learning, MIT Press, 2023. |
Feb 6 | 10 | Constrained RL (slides) | Ray, Achiam, Amodei, Benchmarking Safe Exploration in Deep Reinforcement Learning. Liu, Alev, Liu, Policy Learning with Constraints in Model-free Reinforcement Learning: A Survey, IJCAI, 2021 |
Feb 7 | Assignment 2 due (11:59 pm) | ||
Feb 11 | 11a | Game Theory (slides) (annotated slides) | [RusNor] Sections 21.1-21.3 |
11b | Multi-Agent RL (slides) | Caroline Claus and Craig Boutilier (1998) The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems, AAAI. Michael Littman (1994) Markov games as a framework for multi-agent reinforcement learning, Machine learning proceedings. Junling Hu and Michael P. Wellman (2003) Nash Q-learning for General-Sum Stochastic Games, JMLR. |
|
Feb 13 | 12a | Imitation Learning (slides) | Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In NeurIPS (pp. 4565-4573). Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from observation. In IJCAI (pp. 4950-4957). |
12b | Inverse RL (slides) | Ziebart, B. D., Bagnell, J. A., & Dey, A. K. (2010). Modeling interaction via the principle of maximum causal entropy. In ICML. Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. In ICML (pp. 49-58). |
|
Feb 14 | Paper presentation preferences (11:59 pm) | ||
Feb 17-21 | Reading break | ||
Feb 24 | Project proposal due (11:59 pm) | ||
Feb 25 | 13 | RL with Sequence Modeling (slides) | Esslinger, Platt & Amato (2022). Deep Transformer Q-Networks for Partially Observable Reinforcement Learning. arXiv.
Chen et al.. (2021). Decision transformer: Reinforcement learning via sequence modeling. NeurIPS, 34, 15084-15097. Gu, Goel, & Ré (2022). Efficiently modeling long sequences with structured state spaces. ICLR. Gu, Dao, Ermon, Rudra & Ré (2020). Hippo: Recurrent memory with optimal polynomial projections. NeurIPS, 33, 1474-1487. Gu & Dao (2023) Mamba: Linear-Time Sequence Modeling with Selective State Spaces, First Conference on Language Modeling. Cao et al. (2024). Mamba as Decision Maker: Exploring Multi-scale Sequence Modeling in Offline Reinforcement Learning. CoRR. |
Feb 27 | 14 | RL from human feedback (slides) | Stiennon, Ouyang, Wu, Ziegler, Lowe Voss, Radford, Amodei, Christiano (2020) Learning to summarize from human feedback, NeurIPS.
Ouyang, Wu, Jiang, Wainwright, et al. (2022) Training language models to follow instructions with human feedback, NeurIPS. Holtzman, Buys, Du, Forbes, Choi (2019). The Curious Case of Neural Text Degeneration, arxiv. Rafailov, Sharma, Mitchell, Ermon, Manning, Finn (2023) Direct Preference Optimization: Your Language Model is Secretly a Reward Model, NeurIPS. Rashid, Wu, Fan, Li, Kristiadi, Poupart (2025) Towards Cost-Effective Reward Guided Text Generation, arxiv. |
Feb 28 | Assignment 3 due (11:59 pm) | ||
March 4 | 15 | Multi-task RL (slides) | Vithayathil Varghese, N., & Mahmoud, Q. H. (2020). A survey of multi-task deep reinforcement learning. Electronics, 9(9), 1363.
Deng, Gu, Zheng, Chen, Stevens, Wang, Sun, Su (2023) MIND2WEB: Towards a Generalist Agent for the Web, NeurIPS. Ghosh et al. (2024) Octo: An Open-Source Generalist Robot Policy, NeurIPS. |
Table of paper presentations
Date | Presenter | Discussants | Topic | Papers |
---|---|---|---|---|
March 6 | Arun Cheriakara Joseph | Hanwen Ju, Andy Zheng, Han Zhou, Jaffer Iqbal, Anurag Chakraborty, Yuhua Xiang, Yuxuan Li, Jiayun Zhu, Mojtaba Moodi, Zijian Lu, Dongfu Jiang, Zhiheng Jerry Lyu, Ruotian Wu, Wentao Zhang, Zhengyuan Dong, Luke Rivard | RL+LLMs | Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., ... & He, Y. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. |
Ambrose Lee | Yuansheng Ni | RLHF | Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., ... & Yang, Y. Safe RLHF: Safe Reinforcement Learning from Human Feedback. In The Twelfth International Conference on Learning Representations. | |
March 11 | Arezoo Alipanahv | RLHF | Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., ... & Darrell, T. (2023). Aligning large multimodal models with factually augmented RLHF. arXiv preprint arXiv:2309.14525. | |
Da Saem Lee | RL+diffusion | Black, K., Janner, M., Du, Y., Kostrikov, I., & Levine, S. Training Diffusion Models with Reinforcement Learning. In The Twelfth International Conference on Learning Representations. | ||
March 13 | Zhiheng Jerry Lyu | RL+LLMs | Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2024). Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36. | |
Wentao Zhang | RL+LLMs | Carta, T., Romac, C., Wolf, T., Lamprier, S., Sigaud, O., & Oudeyer, P. Y. (2023, July). Grounding large language models in interactive environments with online reinforcement learning. In International Conference on Machine Learning (pp. 3676-3713). PMLR. | ||
March 18 | Amy Tai | RL for Health | Barata, C., Rotemberg, V., Codella, N. C., Tschandl, P., Rinner, C., Akay, B. N., ... & Kittler, H. (2023). A reinforcement learning model for AI-based decision support in skin cancer. Nature Medicine, 29(8), 1941-1946. | |
Jakub Horenin | RL for Health | Yang, J., Soltan, A. A., Eyre, D. W., & Clifton, D. A. (2023). Algorithmic fairness and bias mitigation for clinical machine learning with deep reinforcement learning. Nature Machine Intelligence, 5(8), 884-894. | ||
March 20 | Yuxuan Li | RL for Robotics | Kalashnikov, D., Varley, J., Chebotar, Y., Swanson, B., Jonschkowski, R., Finn, C., ... & Hausman, K. (2022, January). Scaling up multi-task robotic reinforcement learning. In Conference on Robot Learning (pp. 557-575). PMLR. | |
Alex Starr | RL for Robotics | Margolis, G. B., Yang, G., Paigwar, K., Chen, T., & Agrawal, P. (2024). Rapid locomotion via reinforcement learning. The International Journal of Robotics Research, 43(4), 572-587. | ||
March 25 | Peiyi Zheng | Multiagent RL | Yang, J., Li, A., Farajtabar, M., Sunehag, P., Hughes, E., & Zha, H. (2020). Learning to incentivize other learning agents. Advances in Neural Information Processing Systems, 33, 15208-15219. | |
Qingyang Zhou | Multiagent RL | Wen, M., Kuba, J., Lin, R., Zhang, W., Wen, Y., Wang, J., & Yang, Y. (2022). Multi-agent reinforcement learning is a sequence modeling problem. Advances in Neural Information Processing Systems, 35, 16509-16521. | ||
March 27 | Jovin Bains | RL for Finance | Huang, Y., Zhou, C., Cui, K., & Lu, X. (2024). A multi-agent reinforcement learning framework for optimizing financial trading strategies based on timesnet. Expert Systems with Applications, 237, 121502. | |
Lucas Noritomi-Hartwig | RL for Finance | Coache, A., & Jaimungal, S. (2024). Reinforcement learning with dynamic convex risk measures. Mathematical Finance, 34(2), 557-587. | ||
April 1 | Mohammad Hossein Ebtehaj | RL for Games | Kim, J., Lee, Y. J., Kwak, M., Park, Y. J., & Kim, S. B. (2024). DynaSTI: Dynamics Modeling with Sequential Temporal Information for Reinforcement Learning in Atari. Knowledge-Based Systems, 112103. | |
Mahdi Rahmani | RL for Autonomous Driving | Huang, Z., Liu, H., Wu, J., & Lv, C. (2023). Conditional predictive behavior planning with inverse reinforcement learning for human-like autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 24(7), 7244-7258. | ||
April 3 | David Awosoga | RL for Databases | Gu, T., Feng, K., Cong, G., Long, C., Wang, Z., & Wang, S. (2023). The rlr-tree: A reinforcement learning based r-tree for spatial data. Proceedings of the ACM on Management of Data, 1(1), 1-26. | |
Ruotian Wu | In-Context RL | Lee, J., Xie, A., Pacchiano, A., Chandak, Y., Finn, C., Nachum, O., & Brunskill, E. (2024). Supervised pretraining can learn in-context reinforcement learning. Advances in Neural Information Processing Systems, 36. | ||
April 17 | Project report due (11:59 pm) |