CS885 Winter 2025 - Reinforcement Learning

The schedule below includes two tables: one for concepts (material taught by Pascal) and one for applications (papers presented by students). Readings are complementary and optional.

[SutBar] Richard S. Sutton and Andrew G. Barto, Reinforcement Learning: An Introduction (2nd edition, 2018) freely available online
[Sze] Csaba Szepesvari, Algorithms for Reinforcement Learning freely available online
[ZB] Alex Zai and Brandon Brown, Deep Reinforcement Learning in Action (2nd edition, 2020) freely available online
[GBC] Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning (2016) freely available online
[L] Maxim Lapan, Deep Reinforcement Learning Hands On (2020)
[GK] Laura Graesser and Wah Loon Keng, Foundations of Deep Reinforcement Learning: Theory and Practice in Python (2020)
[SigBuf] Olivier Sigaud and Olivier Buffet (editors), Markov Decision Processes in Artificial Intelligence (2013)
[Put] Martin L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming (2008)
[Ber] Dimitri P. Bertsekas, Dynamic Programming and Optimal Control (2017)
[Pow] Warren B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality (2015)
[RusNor] Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach (4th Edition) (2020)

Table of online modules

Week	Module	Topic	Readings (textbooks)
Jan 7	1a	Course introduction (slides)	[SutBar] Chapter 1, [Sze] Chapter 1
Jan 7	1b	Markov Decision Processes, Value Iteration (slides)	[SutBar] Chap. 3, [Sze] Chap. 2, [RusNor] Sec. 15.1, 17.1-17.2, 17.4, [Put] Chap. 2, 4, 5
Jan 9	2a	Convergence Properties (slides)	[SutBar] Sec. 4.1, 4.4, [Sze] Sec. 2.2, 2.3, [Put] Sec. 6.1-6.3, [SigBuf] Chap. 1
Jan 9	2b	Policy Iteration (slides) (annotated slides)	[SutBar] Sec. 4.3, [Put] Sec. 6.4-6.5, [SigBuf] Sec. 1.6.2.3, [RusNor] Sec. 17.3
Jan 14	3a	Intro to Reinforcement Learning, Q-learning (slides) (annotated slides)	[SutBar] Sec. 5.1-5.3, 6.1-6.3, 6.5, [Sze] Sec. 3.1, 4.3, [SigBuf] Sec. 2.1-2.5, [RusNor] Sec. 21.1-21.3
Jan 14	3b	Deep Q-networks (slides) (annotated slides)	[GBC] Chap. 6, 7, 8, [SutBar] Sec. 9.4, 9.7, [Sze] Sec. 4.3.2
Jan 16	4a	Policy gradient (slides)	[SutBar] Sec. 13.1-13.3, 13.7 [SigBuf] Sec. 5.1-5.2, [RusNor] Sec. 21.5
Jan 16	4b	Actor critic (slides)	[SutBar] Sec. 13.4-13.5, [Sze] Sec. 4.4, [SigBuf] Sec. 5.3
Jan 21	5a	Trust Regions and Proximal Policies (slides)	Schulman, Levine, Moritz, Jordan, Abbeel (2015) Trust Region Policy Optimization, ICML. Schulman, Wolski, Dhariwal, Radford, Klimov (2017) Proximal Policy Optimization, arXiv.
Jan 21	5b	Maximum entropy RL (slides)	Haarnoja, Tang, Abbeel, Levine (2017) Reinforcement Learning with Deep Energy-Based Policies, ICML. Haarnoja, Zhou, Abbeel, Levine (2018) Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, ICML.
Jan 23	6a	Multi-armed bandits (slides)	[SutBar] Sec. 2.1-2.7, [Sze] Sec. 4.2.1-4.2.2
Jan 23	6b	Bayesian and Contextual bandits (slides)	[SutBar] Sec. 2.9
Jan 24	Assignment 1 due (11:59 pm)
Jan 28	7	Offline RL (slides)	Levine, Kumar, Tucker, Fu (2021) Offline reinforcement learning: Tutorial, review, and perspectives on open problems, arxiv. Kumar, Zhou, Tucker, Levine (2020) Conservative Q-Learning for Offline Reinforcement Learning, NeurIPS.
Jan 30	8a	Model-based RL (slides)	[SutBar] Chap. 8
Jan 30	8b	Partially observable RL, DRQN (slides)	Hausknecht, M., & Stone, P. Deep recurrent Q-learning for partially observable MDPs. In 2015 AAAI fall symposium series.
Feb 4	9	Distributional RL (slides)	Bellemare, Dabney, Munos. A distributional perspective on reinforcement learning. ICML. 2017. Bellemare, Dabney, Rolland. Distributional Reinforcement Learning, MIT Press, 2023.
Feb 6	10	Constrained RL (slides)	Ray, Achiam, Amodei, Benchmarking Safe Exploration in Deep Reinforcement Learning. Liu, Alev, Liu, Policy Learning with Constraints in Model-free Reinforcement Learning: A Survey, IJCAI, 2021
Feb 7	Assignment 2 due (11:59 pm)
Feb 11	11a	Game Theory (slides) (annotated slides)	[RusNor] Sections 21.1-21.3
Feb 11	11b	Multi-Agent RL (slides)	Caroline Claus and Craig Boutilier (1998) The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems, AAAI. Michael Littman (1994) Markov games as a framework for multi-agent reinforcement learning, Machine learning proceedings. Junling Hu and Michael P. Wellman (2003) Nash Q-learning for General-Sum Stochastic Games, JMLR.
Feb 13	12a	Imitation Learning (slides)	Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In NeurIPS (pp. 4565-4573). Torabi, F., Warnell, G., & Stone, P. (2018). Behavioral cloning from observation. In IJCAI (pp. 4950-4957).
Feb 13	12b	Inverse RL (slides)	Ziebart, B. D., Bagnell, J. A., & Dey, A. K. (2010). Modeling interaction via the principle of maximum causal entropy. In ICML. Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. In ICML (pp. 49-58).
Feb 14	Paper presentation preferences (11:59 pm)
Feb 17-21	Reading break
Feb 24	Project proposal due (11:59 pm)
Feb 25	13	RL with Sequence Modeling (slides)	Esslinger, Platt & Amato (2022). Deep Transformer Q-Networks for Partially Observable Reinforcement Learning. arXiv. Chen et al.. (2021). Decision transformer: Reinforcement learning via sequence modeling. NeurIPS, 34, 15084-15097. Gu, Goel, & Ré (2022). Efficiently modeling long sequences with structured state spaces. ICLR. Gu, Dao, Ermon, Rudra & Ré (2020). Hippo: Recurrent memory with optimal polynomial projections. NeurIPS, 33, 1474-1487. Gu & Dao (2023) Mamba: Linear-Time Sequence Modeling with Selective State Spaces, First Conference on Language Modeling. Cao et al. (2024). Mamba as Decision Maker: Exploring Multi-scale Sequence Modeling in Offline Reinforcement Learning. CoRR.
Feb 27	14	RL from human feedback (slides)	Stiennon, Ouyang, Wu, Ziegler, Lowe Voss, Radford, Amodei, Christiano (2020) Learning to summarize from human feedback, NeurIPS. Ouyang, Wu, Jiang, Wainwright, et al. (2022) Training language models to follow instructions with human feedback, NeurIPS. Holtzman, Buys, Du, Forbes, Choi (2019). The Curious Case of Neural Text Degeneration, arxiv. Rafailov, Sharma, Mitchell, Ermon, Manning, Finn (2023) Direct Preference Optimization: Your Language Model is Secretly a Reward Model, NeurIPS. Rashid, Wu, Fan, Li, Kristiadi, Poupart (2025) Towards Cost-Effective Reward Guided Text Generation, arxiv.
Feb 28	Assignment 3 due (11:59 pm)
March 4	15	Multi-task RL (slides)	Vithayathil Varghese, N., & Mahmoud, Q. H. (2020). A survey of multi-task deep reinforcement learning. Electronics, 9(9), 1363. Deng, Gu, Zheng, Chen, Stevens, Wang, Sun, Su (2023) MIND2WEB: Towards a Generalist Agent for the Web, NeurIPS. Ghosh et al. (2024) Octo: An Open-Source Generalist Robot Policy, NeurIPS.

Table of paper presentations

Date	Presenter	Discussants	Topic	Papers
March 6	Arun Cheriakara Joseph	Hanwen Ju, Andy Zheng, Han Zhou, Jaffer Iqbal, Anurag Chakraborty, Yuhua Xiang, Yuxuan Li, Jiayun Zhu, Mojtaba Moodi, Zijian Lu, Dongfu Jiang, Zhiheng Jerry Lyu, Ruotian Wu, Wentao Zhang, Zhengyuan Dong, Luke Rivard	RL+LLMs	Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., ... & He, Y. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
March 6	Ambrose Lee	Yuansheng Ni	RLHF	Dai, J., Pan, X., Sun, R., Ji, J., Xu, X., Liu, M., ... & Yang, Y. Safe RLHF: Safe Reinforcement Learning from Human Feedback. In The Twelfth International Conference on Learning Representations.
March 11	Arezoo Alipanahv	Anurag Chakraborty, Ruoxi Ning	RLHF	Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., ... & Darrell, T. (2023). Aligning large multimodal models with factually augmented RLHF. arXiv preprint arXiv:2309.14525.
March 11	Da Saem Lee	Anurag Chakraborty, Xiaochun Tong, Mohammed Abdulrahman, Ziyi Chen, Hisham Khalil, Dongfu Jiang	RL+diffusion	Black, K., Janner, M., Du, Y., Kostrikov, I., & Levine, S. Training Diffusion Models with Reinforcement Learning. In The Twelfth International Conference on Learning Representations.
March 13	Zhiheng Jerry Lyu	Xiang Li, Ruoxi Ning, Han Zhou, Jiayun Zhu, Felicia Feng, Hanwen Ju, Miao Zheng, Dongfu Jiang, Zhengyuan Dong,	RL+LLMs	Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2024). Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36.
March 13	Wentao Zhang	Qingyang Zhou	RL+LLMs	Carta, T., Romac, C., Wolf, T., Lamprier, S., Sigaud, O., & Oudeyer, P. Y. (2023, July). Grounding large language models in interactive environments with online reinforcement learning. In International Conference on Machine Learning (pp. 3676-3713). PMLR.
March 18	Amy Tai	Jaffer Iqbal, Eimaan Saqib, Mohammed Abdulrahman, Hongfei Huang, Ruizhe Wang, Xintong Zhou, Mojtaba Moodi, Yuansheng Ni,	RL for Health	Barata, C., Rotemberg, V., Codella, N. C., Tschandl, P., Rinner, C., Akay, B. N., ... & Kittler, H. (2023). A reinforcement learning model for AI-based decision support in skin cancer. Nature Medicine, 29(8), 1941-1946.
March 18	Jakub Horenin	Xiaochun Tong, Fatima Sohail, Jonathan Zingaro, Ruizhe Wang, Amy Tai, Xintong Zhou	RL for Health	Yang, J., Soltan, A. A., Eyre, D. W., & Clifton, D. A. (2023). Algorithmic fairness and bias mitigation for clinical machine learning with deep reinforcement learning. Nature Machine Intelligence, 5(8), 884-894.
March 20	Yuxuan Li	Eimaan Saqib, Leroy D'Souza, Hongfei Huang, Arthur Chen, Andrew Nasif, Yuhua Xiang, Mojtaba Moodi, Hisham Khalil	RL for Robotics	Kalashnikov, D., Varley, J., Chebotar, Y., Swanson, B., Jonschkowski, R., Finn, C., ... & Hausman, K. (2022, January). Scaling up multi-task robotic reinforcement learning. In Conference on Robot Learning (pp. 557-575). PMLR.
March 20	Alex Starr	Eimaan Saqib, Yuxi Chen, Andrew Nasif, Leroy D'Souza, Ziyi Chen, Zijian Lu, Hisham Khalil,	RL for Robotics	Margolis, G. B., Yang, G., Paigwar, K., Chen, T., & Agrawal, P. (2024). Rapid locomotion via reinforcement learning. The International Journal of Robotics Research, 43(4), 572-587.
March 25	Peiyi Zheng	Xiaochun Tung, Hanna Derets, Mohammad Hossein Ebtehaj, Wangzheng Wang, Hongfei Huang, Xiang Li, David Awosoga, Arun Cheriakara Joseph, Sriram Gopalakrishnan, Fatima Sohail, Gaurav Sehgal, Felicia Feng, Andrew Nasif, Ruoxi Ning, Lucas Noritomi-Hartwig	Multiagent RL	Yang, J., Li, A., Farajtabar, M., Sunehag, P., Hughes, E., & Zha, H. (2020). Learning to incentivize other learning agents. Advances in Neural Information Processing Systems, 33, 15208-15219.
March 25	Qingyang Zhou	Kimia Ghodsifar, Xiang Li, Hanna Derets, Weiming Ren, Han Zhou, Wangzheng Wang, Miao Zheng, Gaurav Sehgal, Zijian Lu, Felicia Feng, Mahdi Rahmani	Multiagent RL	Wen, M., Kuba, J., Lin, R., Zhang, W., Wen, Y., Wang, J., & Yang, Y. (2022). Multi-agent reinforcement learning is a sequence modeling problem. Advances in Neural Information Processing Systems, 35, 16509-16521.
March 27	Jovin Bains	Hanna Derets, Alex Starr, Wangzheng Wang, Avi Veeramoothoo, Miao Zheng, Sriram Gopalakrishnan, Hanwen Ju	RL for Finance	Huang, Y., Zhou, C., Cui, K., & Lu, X. (2024). A multi-agent reinforcement learning framework for optimizing financial trading strategies based on timesnet. Expert Systems with Applications, 237, 121502.
March 27	Lucas Noritomi-Hartwig	Avi Veeramoothoo, Yuxi Chen, Richard Fan	RL for Finance	Coache, A., & Jaimungal, S. (2024). Reinforcement learning with dynamic convex risk measures. Mathematical Finance, 34(2), 557-587.
April 1	Mohammad Hossein Ebtehaj	Sriram Gopalakrishnan, Arthur Chen, Xiang Li, Ruizhe Wang, Andy Zheng, Xintong Zhou, Yuansheng Ni	RL for Games	Kim, J., Lee, Y. J., Kwak, M., Park, Y. J., & Kim, S. B. (2024). DynaSTI: Dynamics Modeling with Sequential Temporal Information for Reinforcement Learning in Atari. Knowledge-Based Systems, 112103.
April 1	Mahdi Rahmani	Kimia Ghodsifar, Jaffer Iqbal, Avi Veeramoothoo, Arezoo Alipanah, Weiming Ren, Yuhua Xiang, Da Saem Lee, Yuxi Chen, Richard Fan, Songcheng Cai, Ruizhe Wang, Fatima Sohail, Luke Rivard, Ziyi Chen, Jakub Horenin	RL for Autonomous Driving	Huang, Z., Liu, H., Wu, J., & Lv, C. (2023). Conditional predictive behavior planning with inverse reinforcement learning for human-like autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 24(7), 7244-7258.
April 3	David Awosoga	Jonathan Zingaro, Gaurav Sehgal, Songcheng Cai, Zhengyuan Dong	RL for Databases	Gu, T., Feng, K., Cong, G., Long, C., Wang, Z., & Wang, S. (2023). The rlr-tree: A reinforcement learning based r-tree for spatial data. Proceedings of the ACM on Management of Data, 1(1), 1-26.
April 3	Ruotian Wu	Kimia Ghodsifar, Mohammed Abdulrahman, Ambrose Lee, Peiyi Zheng, Weiming Ren, Andy Zheng, Leroy D'Souza, Richard Fan, Jovin Bains, Jonathan Zingaro, Arthur Chen, Jiayun Zhu, Songcheng Cai, Luke Rivard	In-Context RL	Lee, J., Xie, A., Pacchiano, A., Chandak, Y., Finn, C., Nachum, O., & Brunskill, E. (2024). Supervised pretraining can learn in-context reinforcement learning. Advances in Neural Information Processing Systems, 36.
April 17	Project report due (11:59 pm)