PhD Defence • Artificial Intelligence | Reinforcement Learning • Policy Learning under Uncertainty and Risk | Cheriton School of Computer Science

Tuesday, August 6, 2024 1:00 pm - 4:00 pm EDT (GMT -04:00)

Please note: This PhD defence will take place in DC 2310 and online.

Yudong Luo, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Pascal Poupart

Recent years have seen a rapid growth of reinforcement learning (RL) research. In year 2015, deep RL achieved superhuman performance in Atari video games. In year 2016, the Alpha Go developed by Google DeepMind beat Lee Sedol, one of the top Go players in South Korea. In year 2022, OpenAI released ChatGPT 3.5, a powerful large language model, which is fine-tuned by RL algorithms. Traditional RL considers the problem that an agent interacts with an environment to acquire a good policy. The performance of the policy is usually evaluated by the expected value of total discounted rewards (or called return) collected in the environment. However, the mostly studied domains (including the three mentioned above) are usually deterministic or contain less randomness. In many real world applications, the domains are highly stochastic, thus agents need to perform decision making under uncertainty. Due to the randomness of the environment, another natural consideration is to minimize the risk, since only maximizing the expected return may not be sufficient. For instance, we want to avoid huge financial loss in portfolio management, which motivates the mean variance trade off.

In this thesis, we focus on the problem of policy learning under uncertainty and risk. This requires the agent to quantify the intrinsic uncertainty of the environment and be risk-averse in specific cases, instead of only caring for the mean of the return.

To quantify the intrinsic uncertainty, in this thesis, we stick to the distributional RL method. Due to the stochasticity of the environment dynamic and also stochastic policies, the future return that an agent can get at a state is naturally a random variable. Distributional RL aims to learn the full value distribution of this random variable. Usually, the value distribution is represented by its quantile function. However, the quantile functions learned by existing algorithms suffer from limited representation ability or quantile crossing issue, which is shown to hinder policy learning and exploration. We propose a new learning algorithm to directly learn a monotonic, smooth, and continuous quantile representation, which provides much flexibility for value distribution learning in distributional RL.

For risk-averse policy learning, we study two common types of risk measure, i.e., measure of variability, e.g., variance, and tail risk measure, e.g., conditional value at risk (CVaR). 1) Mean variance trade off is a classic yet popular problem in RL. Traditional methods directly restrict the total return variance. Recent methods restrict the per-step reward variance as a proxy. We thoroughly examine the limitations of these variance-based methods in the policy gradient approach, and propose to use an alternative measure of variability, Gini deviation, as a substitute. We study various properties of this new risk measure and derive a policy gradient algorithm to minimize it. 2) CVaR is another popular risk measure for risk-averse RL. However, RL algorithms utilizing policy gradients to optimize CVaR face significant challenges with sample inefficiency, hindering their practical applications. This inefficiency stems from two main facts: a focus on tail-end performance that overlooks many sampled trajectories, and the potential of gradient vanishing when the lower tail of the return distribution is overly flat. To address these challenges, we start from an insight that in many scenarios, the risk-averse behavior is only required in a subset of states, and propose a simple mixture policy parameterization. This method integrates a risk-neutral policy with an adjustable policy to form a risk-averse policy. By employing this strategy, all collected trajectories can be utilized for policy updating, and the issue of vanishing gradients is counteracted by stimulating higher returns through the risk-neutral component, thus the sample efficiency is significantly improved.

To attend this PhD defence in person, please go to DC 2310. You can also attend virtually using Zoom.

Location Information

Location Address: DC - William G. Davis Computer Research Centre
200 University Ave West
Hybrid: DC 2310 | Online PhD defence
Waterloo, ON, CA N2L 3G1

Location coordinates: