Cheat Sheet

Lecture 1: Course Introduction

Some Definitions

Systems that think like humans Systems that think rationally
Systems that act like humans Systems that act rationally

Topics we will cover

Lecture 2: Uninformed Search Techniques

Problem-solving agents

Sample Questions

Common Characteristics

Generic Search Algorithm

  1. Initialize search algorithm with initial state of the problem
  2. Repeat
    1. If no candidate nodes can be expanded, return failure
    2. Choose leaf node for expansion, according to search strategy
    3. If node contains a goal state, return solution
    4. Otherwise, expand the node, by applying legal operators to the state within the node. Add resulting nodes to the tree

Evaluating search algorithms

b Branching factor
d Depth of shallowest goal node
m Maximum length of any path in the state space

Breadth-first search

All nodes on a given level are expanded before any node on the next level is expanded.

Implemented with a FIFO queue

Judging BFS

All uninformed search methods will have exponential time complexity

Uniform Cost Search

Variation of breadth-first search

Properties of Uniform Cost Search

$C^*$ is cost of optimal solution, $\epsilon$ is minimum action cost

Depth-first search

The deepest node in the current fringe of the search tree is expanded first.

Implemented with a stack (LIFO queue)

Judging DFS

Do not use DFS if you suspect a large tree depth

Depth-limited search

Iterative-deepening

General strategy that repeatedly does depth-limited search, but increases the limit each time

Complete, Optimal, $O(b^d)$ time, $O(bd)$ space

Summary

BFS Uniform DFS DLS IDS
Complete Yes Yes No No Yes
Time $O(b^d)$ $O(b^{ceiling(C^*/\epsilon))}$ $O(b^m)$ $O(b^l)$ $O(b^d)$
Space $O(b^d)$ $O(b^{ceiling(C^*/\epsilon))}$ $O(bm)$ $O(bl)$ $O(bd)$
Optimal Yes Yes No No Yes

Lecture 3: Informed Search Techniques

Informed Search

Our knowledge is often about the merit of nodes. Value of being at a node

Different notions of merit

We will focus on cost of solution.

We need to develop a domain specific heuristic function, $h(n)$

$h(n)$ guesses the cost of reaching the goal from node $n$

We often have some information about the problem that can be used in forming a heuristic function (i.e., heuristics are domain specific)

If $h(n_1) < h(n_2)$ then we guess that it is cheaper to reach the goal from $n_1$ than it is from $n_2$

We require

Greedy best-first search

Use the heuristic function, $h(n)$, to rank the nodes in the fringe

Search strategy: Expand node with lowest $h$-value

Greedily trying to find the least-cost solution

Properties of greedy search

A* Search

Greedy best-first search is too greedy. It does not take into account the cost of the path so far!

Define $f(n) = g(n) + h(n)$ where $g(n)$ is the cost of the path to node $n$, $h(n)$ is the heuristic estimate of cost of reaching goal from node $n$.

A* search: Expand node in fringe (queue) with lowest $f$ value

When should A* terminate?

A* terminates only when goal state is popped from the queue

Is A* Optimal?

No.

Admissible heuristics

Let $h^*(n)$ denote the true minimal cost to the goal from node $n$

A heuristic, $h$, is admissible if $h(n) \leq h^*(n)$ for all $n$

Admissible heuristics never overestimate the cost to the goal. In other words, optimistic

If the heuristic is admissible then A* with tree-search is optimal.

To search graphs, we need something stronger than admissibility

A* graph-search with a consistent heuristic is optimal

Properties of A*

Memory-bounded heuristic search

A* keeps most generated nodes in memory. On many problems A* will run out of memory

Heuristic Functions

A good heuristic function can make all the difference!

One approach is to think of an easier problem and let $h(n)$ be the cost of reaching the goal in the easier problem

Lecture 4: Constraint Satisfaction

CSP Definition

A constraint satisfaction problem (CSP) is defined by $\{V,D,C\}$ where

Properties of CSPs

CSPs and commutativity

CSPs are commutative!

All CSP search algorithms generate successors by considering possible assignments for only a single variable at each node in the search tree

Backtracking Search

Check the algorithm in the textbook/slides.

Backtracking and efficiency

Most constrained variable

Most constraining variable

Least-constraining value

Forward checking

Lecture 5: Uncertainty

Terminology

Probability distribution
A specification of a probability for each event in our sample space. Probabilities must sum to 1
Joint probability distribution
Specification of probabilities for all combinations of events

Joint distribution

Given two random variables $A$ and $B$, the joint distribution is defined as: $Pr(A = a \wedge B = b)$ for all $a, b$

Marginalisation (sumout rule):

Conditional Probability

$Pr(A|B)$: fraction of worlds in which $B$ is true that also have $A$ true

Bayes Rule

More General Forms of Bayes Rule

$$P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A)+P(B|\neg A)P(\neg A)}$$ $$P(A | B \wedge X) = \frac{P(B|A \wedge X) P(A|X)}{P(B|X)}$$ $$P(A = v_i | B) = \frac{P(B|A = v_i) P(A = v_i)}{\sum_{j=1}^n P(B|A = v_j) P(A = v_j)}$$

Probabilistic Inference

Given a prior distribution $Pr(X)$ over variables $X$ of interest, representing degrees of belief, and given new evidence $E = e$ for some variable $E$, revise your degrees of belief: posterior $Pr(X|E = e)$.

Issue

How do we specify the full joint distribution over a set of random variables $X_1, X_2, \dots, X_n$?

Inference is frightfully slow

Independence

X and Y are independent iff: $\forall x \in dom(X), y \in dom(Y)$

Intuitively, learning the value of $Y$ doesn't influence our beliefs about $X$ and vice versa.

Conditional Independence

Two variables $X$ and $Y$ are conditionally independent given variable $Z$ iff: $\forall x \in dom(X), y \in dom(Y), z \in dom(Z)$

If you know the value of $Z$ (whatever it is), nothing you learn about $Y$ will influence your beliefs about $X$

What good is independence?

Lecture 6: Bayesian Networks

Bayesian Networks (BN)

Graphical representation of the direct dependencies over a set of variables + a set of conditional probability tables (CPTs) quantifying the strength of those influences.

A BN over variables $\{X_1, X_2, \dots, X_n\}$ consists of:

Also known as

Key notions

Semantics of a Bayes Net

The structure of the BN means: every $X_i$ is conditionally independent of all of its non-descendants given its parents:

$Pr(X_i | S \cup Par(X_i)) = Pr(X_i | Par(X_i))$ for any subset $S \subseteq NonDescendants(X_i)$

The chain rule: $\begin{align*} Pr(X_1, X_2, \dots, X_n) &= Pr(X_1)Pr(X_2 | X_1)Pr(X_3 | X_1, X_2) \dots Pr(X_n | X_1, X_2, \dots, X_{n-1}) \\ &= \prod_{i=1}^n Pr(X_i | Parents(X_i)) \\ \end{align*}$

The joint is recoverable using the parameters (CPTs) specified in an arbitrary BN

Constructing a Bayes Net

Given any distribution over variables $X_1, X_2, \dots , X_n$, we can construct a Bayes net that faithfully represents that distribution.

Take any ordering of the variables (say, the order given), and go through the following procedure for $X_n$ down to $X_1$.

In the end, a DAG is produced and the BN semantics must hold by construction.

Causal Intuitions

The construction of a BN is simple

Testing Independence

Given a BN, how can we test whether two variables $X$ and $Y$ are independent (given evidence $E$)?

D-separation

A set of variables $E$ d-separates $X$ and $Y$ if it blocks every undirected path in the BN between $X$ and $Y$.

$X$ and $Y$ are conditionally independent given evidence $E$ if $E$ d-separates $X$ and $Y$

d-SEPARATION WITHOUT TEARS

Variable Elimination

The algorithm, variable elimination, simply applies the summing out rule repeatedly.

Factors

A function $f(X_1, X_2, \dots, X_k)$ is also called a factor. We can view this as a table of numbers, one for each instantiation of the variables $X_1, X_2, \dots, X_k$.

Notation: $f(\mathbf{X},\mathbf{Y})$ denotes a factor over the variables $\mathbf{X} \cup \mathbf{Y}$. (Here $\mathbf{X}, \mathbf{Y}$ are sets of variables.)

The Product of Two Factors

Summing a Variable Out of a Factor

Restricting a Factor

Relevance: A Sound Approximation

We can restrict attention to relevant variables. Given query $Q$, evidence $\mathbf E$:

We can restrict our attention to the subnetwork comprising only relevant variables when evaluating a query Q

Lecture 8: Causal Inference

Causality

Causality is the study of how things influence one other, how causes lead to effects.

Causal dependence: $X$ causes $Y$ iff changes to $X$ induce changes to $Y$

Causal and Non-Causal Correlations

A joint distribution $P(X, Y)$ captures correlations between $X$ and $Y$, but does not indicate whether a causal relation exists between $X$ and $Y$ nor the direction of the causal relation when it exists.

A conditional distribution $P(Y|X)$ does not necessarily indicate $X$ causes $Y$.

Causal Bayesian Network

Definition: Bayesian network where all edges indicate direct causal effects.

Causal Inference

Intervention: What is the effect of an action?

Causal networks can easily support intervention queries, but not non-causal networks do not.

Inference with Do Operator

$$P(X | do(Y = y), Z = z)$$

In a causal graph:

  1. Remove edges pointing to $Y$ and $P(Y|parents(Y))$
  2. Perform variable elimination on remaining graph:
    1. Restrict factors to evidence: $Y = y$ and $Z = z$
    2. Eliminate variables
    3. Multiply remaining factors and normalize

Counterfactual Analysis

Counterfactual analysis (or counterfactual thinking): explores outcomes that did not actually occur, but which could have occurred under different conditions. It's a kind of what if? analysis and is a useful way for testing cause-and-effect relationships.

Structural Causal Models

Idea: separate causal relations from noise

Structural Causal Model contains:

Conversion

Structural Causal Models (SCMs) can be converted into equivalent Causal Bayesian Network, but not the other way around

SCMs separate causal relations from the noise and therefore provide more information

Counterfactual Analysis

Three Steps: 因果推理初探(7)——反事实(上)

Lecture 9: Reasoning over Time

Dynamic Inference

Need to reason over time

Stochastic Process

Definition

Can be viewed as a Bayes net with one random variable per time slice

Problems

Solutions

K-order Markov Process

Assumption: last $k$ states sufficient

Advantage

Can specify entire process with finitely many time slices.

Hidden Markov Models

Problems

First-order Hidden Markov Model

Definition:

Inference in temporal models

Monitoring

$Pr(s_t | o_t, \dots, o_1)$: distribution over current state given observations

Forward algorithm

Prediction

$Pr(s_{t+k} | o_t, \dots, o_1)$: distribution over future state given observations

Forward algorithm

Hindsight

$Pr(s_k | o_t, \dots, o_1)$ for $k < t$: distribution over a past state given observations

Forward-backward algorithm

Most likely explanation

$Argmax_{s_0, \dots, s_t} Pr(s_0, \dots,s_t|o_t, \dots, o_1)$: most likely state sequence given observations

Viterbi algorithm

Complexity of temporal inference

Dynamic Bayesian Networks

DBN complexity

Non-Stationary Process

What if the process is not stationary?

Solution: add new state components until dynamics are stationary

Non-Markovian Process

What if the process is not Markovian?

Solution: add new state components until dynamics are Markovian

Markovian Stationary Process

Problem: adding components to the state description to force a process to be Markovian and stationary may significantly increase computational complexity

Solution: try to find the smallest state description that is self-sufficient (i.e., Markovian and stationary)

Lecture 10: Decision Tree Learning

What is Machine Learning?

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. [Tom Mitchell, 1997]

Inductive learning (also known as concept learning)

Hypothesis Space

Hypothesis space $H$

Objective

Generalization

A good hypothesis will generalize well (i.e., predict unseen examples correctly)

Usually, any hypothesis h found to approximate the target function f well over a sufficiently large set of training examples will also approximate the target function well over any unobserved examples

Ockham's razor

Prefer the simplest hypothesis consistent with data

Inductive learning

Decision trees

Decision tree classification

Decision trees can represent disjunctions of conjunctions of constraints on attribute values

Decision tree representation

Inducing a decision tree

Choosing attribute tests

Using information theory

Measure uncertainty (Entropy): $$I(P(v_1), \dots, P(v_n)) = \sum_{i = 1}^n -P(v_i) \log_2 P(v_i)$$ For a training set containing $p$ positive examples and $n$ negative examples: $$I(\frac{p}{p+n}, \frac{n}{p+n}) = -\frac{p}{p+n} \log_2 \frac{p}{p+n} - \frac{n}{p+n} \log_2 \frac{n}{p+n}$$

Information gain

A chosen attribute $A$ divides the training set $E$ into subsets $E_1, \dots , E_v$ according to their values for $A$, where $A$ has $v$ distinct values. $$remainder(A) = \sum_{i = 1}^v \frac{p_i + n_i}{p + n} I\left( \frac{p_i}{p_i + n_i}, \frac{n_i}{p_i + n_i} \right)$$ Information Gain (IG) or reduction in uncertainty from the attribute test: $$IG(A) = I\left( \frac{p}{p + n}, \frac{n}{p + n} \right) - remainder(A)$$ Choose the attribute with the largest IG

Performance of a learning algorithm

Overfitting

Given a hypothesis space $H$, a hypothesis $h \in H$ is said to overfit the training data if there exists some alternative hypothesis $h' \in H$ such that $h$ has smaller error than $h'$ over the training examples but $h'$ has smaller error than $h$ over the entire distribution of instances

Overfitting can decrease accuracy of decision trees by 10-25%

Avoiding overfitting

Two popular techniques:

  1. Prune statistically irrelevant nodes
  2. Ideally: stop growing tree when test set performance decreases

Choosing Tree Size

Robust validation

Cross-Validation

Lecture 11: Statistical Learning

Statistical Learning

Bayesian Learning

Bayesian Prediction

Bayesian learning properties

Bayesian learning tradeoff

Maximum a posteriori (MAP)

MAP properties

MAP computation

Maximum Likelihood (ML)

ML properties

Statistical Learning

Laplace Smoothing

Bayesian Network Parameter Learning (Max Likelihood)

Lecture 12: Case Studies

Omitted.

Lecture 13: Neural Networks

Artificial Neural Networks

ANN Unit

For each unit i:

Activation Function

Network Structures

Perceptron

Threshold Perceptron Hypothesis Space

Linear Separability

Sigmoid Perceptron

Multilayer Networks

Common activation functions \(h\)

Weight training

Least squared error

Error function \[E(\mathbf{W}) = \frac{1}{2} \sum_{n} E_n(\mathbf{W}) = \frac{1}{2} \sum_{n} || f(\boldsymbol{x_n}, \mathbf{W}) - y_n||_2^2 \] where \(\mathbf{x_n}\) is the input of the \(n^{th}\) example, \(y_n\) is the label of the \(n^{th}\) example, \(f(\boldsymbol{x_n}, \mathbf{W})\) is the output of the neural net.

Sequential Gradient Descent

Backpropagation Algorithm

Forward phase

Backward phase

Simple Example Using \(\tanh(x)\)

Please refer to the notes...

Non-linear regression examples

Analysis

Lecture 14: Deep Neural Networks

Deep Neural Networks

Expressiveness

Vanishing Gradients

Avoiding Vanishing Gradients

Rectified Linear Units

Overfitting

Dropout

Intuition

Lecture 15: Decision Networks

Preferences

Axioms

Why Impose These Conditions?

Utilities

Expected Utility

Decision Networks

Chance Nodes

Decision Nodes

Value node

Assumptions

Policies

Value of a Policy

Optimal Policies

Optimizing Policies: Key Points

Lecture 16: Markov Decision Processes

Markov Decision Processes

Formal Definition

Current Assumptions

Transition Model

Reward Model

Discounted/Average Rewards

If process infinite, isn’t \(\sum_t R(s_t, a_t)\) infinite?

Policy

Policy Optimization

Value Iteration

Please check the course notes for the pseudocode and the matrix form.

Horizon Effect

Infinite Horizon

Policy Iteration

Alternate between two steps

Please check the course notes for the pseudocode.

Complexity

Lecture 17: Reinforcement Learning

Reinforcement Learning

Same as the Markov Decision Processes, with the only difference that we don't know the Transition model and Reward Model.

Policy Optimization

Current Assumptions

Important Components in RL

RL agents may or may not include the following components:

Model Free Evaluation

Temporal Difference (TD) Evaluation

Please check course notes for pseudocode.

Comparison

Model Free Control

Bellman's Equation

Monte Carlo Control

Temporal Difference Control

Q-Learning

Please check the course notes for pseudocode.

Exploration vs Exploitation

Common Exploration Methods

Exploration and Q-learning

Lecture 18: Deep Reinforcement Learning

Function to be Approximated

Q-function Approximation

Gradient Q-learning

Please check the course notes for pseudocode.

Convergence of Linear Function Approximation Q-Learning

Linear Q-Learning converges under the same conditions as Tabular Q-learning.

Divergence of Non-linear Gradient Q-learning

Mitigating divergence

Experience Replay

Target Network

Similar to value iteration.

Check the course notes for formula.

Deep Q-network (DQN)

Check the course notes for pseudocode.

Lecture 19: Model-based Reinforcement Learning

Model-free Online RL

Model-based Online RL

Model-based RL

Model-based RL (with Value Iteration)

Please check the course notes for pseudocode.

Partial Planning

Model-based RL (with Q-learning)

Please check the course notes for pseudocode.

Partial Planning vs Replay Buffer

Dyna

Dyna-Q

Please check the course notes for pseudocode.

Planning from Current State

Tractable Tree Search

Monte Carlo Tree Search (with upper confidence bound)

Please check the course notes for pseudocode.

Monte Carlo Tree Search (continued)

Please check the course notes for pseudocode.

Monte Carlo Tree Search (continued)

Please check the course notes for pseudocode.

AlphaGo

Four steps:

  1. Supervised Learning of Policy Networks
  2. Policy gradient with Policy Networks
  3. Value gradient with Value Networks
  4. Searching with Policy and Value Networks

Search Tree

Simulation

  • At each node, select edge \(a^*\) that maximizes \[a^* = \arg \max_a Q(s, a) + u(s, a)\]
  • where \(u(s, a) \propto \frac{\pi(a | s)}{1 + n(s, a)}\) is an exploration bonus \[Q(s, a) = \frac{1}{n(s, a)} \sum_{i} \mathbb 1_i(s, a) \left[ \lambda V_w(s) + (1 - \lambda) G_i \right] \] \[\mathbb 1_i(s, a) = \begin{cases} 1 & \text{if } \text{simulation } i \text{ visits } (s, a) \\ 0 & \text{otherwise} \end{cases}\]
  • Lecture 20: Multi-Armed Bandits

    Exploration/Exploitation Tradeoff

    Stochastic Bandits

    Simple Heuristics

    Regret

    Theoretical Guarantees

  • When \(\epsilon\) is constant, then
  • When \(\epsilon_t \propto 1/t\)
  • Empirical Mean

    Positivism in the Face of Uncertainty

    Convergence

  • Theorem: An optimistic strategy that always selects \(\arg \max_a UB_n(a)\) will converge to \(a^*\)
  • Proof by contradiction:
  • Probabilistic Upper Bound

  • Problem: We can't compute an upper bound with certainty since we are sampling
  • However we can obtain measures that are upper bounds most of the time
  • i.e., \(Pr(R(a) \leq f(a)) \geq 1 - \delta\)
  • Example: Hoeffding's inequality \[Pr\left( R(a) \leq \tilde R(a) + \sqrt{\frac{\log(1/\delta)}{2n_a}} \right) \geq 1 - \delta\] where \(n_a\) is the number of trials of arm \(a\)
  • Upper Confidence Bound (UCB)

    Please check the course notes for pseudocode.

    UCB Convergence

    Multi-Armed Bandits

    Bayesian Learning

    Distributional Information

    Coin Example

    Bernoulli Variables

    Beta Distribution

    Belief Update

    Thompson Sampling

    Thompson Sampling Algorithm Bernoulli Rewards

    Please check the course notes for pseudocode.

    Comparison

    Please check the course notes for pseudocode.

    Sample Size

    Analysis

    Lecture 21: Game Theory

    Multi-agent Decision Making

    Game

    Game: Any set of circumstances, where outcomes depend on actions of two or more rational and self-interested players

    Game Theory

    Game Theory: Mathematical model of strategic interactions among more than one rational agents in a game

    Learning in multi-agent frameworks

    Based on utility function

    Games can be

    Normal Form Games

    Payoff Matrices

    Check notes for the picture.

    Playing a Normal-Form Game

    Terminologies

    Dominant Strategy Equilibrium

    Please check the course notes for the Alice, Bob example.

    Iterative elimination of dominated strategies

    Best Response

  • Given a strategy profile \(\{\sigma_i, \sigma_{-i}\}\), agent \(i\)'s strategy \(\sigma_i\) is a best response to the other agents' strategies \(\sigma_{-i}\) if and only if \[u_i(\sigma_i, \sigma_{-i}) \geq u_i (\sigma'_i, \sigma_{-i}), \forall \sigma'_i \neq \sigma_i\]
  • A rational agent will always play a best response
  • Nash Equilibrium

    Solving for Nash equilibrium - I

    Solving for Nash equilibrium - II

  • Another Idea: Fix strategy for one player and find the best response for other
  • If Bob goes for Soccer, the best response for Alice is to go for Soccer
  • If Bob goes for Baseball, the best response for Alice is to go for Baseball
  • So we have two pure strategy NE: {Soccer, Soccer} and {Baseball, Baseball}
  • Lecture 22:Game Theory II

    Pareto dominance

    An outcome \(o\) Pareto dominates another outcome \(o'\) if and only if every player is weakly better off in \(o\) and at least one player is strictly better off in \(o\) \[u_i(o) \geq u_i(o'), \forall i\] \[u_i(o) > u_i(o'), \exists i\]

    Pareto Optimality

    An outcome \(o\) is Pareto optimal if and only if no other outcome \(o'\) Pareto dominates \(o\).

    Mixed-strategy NE

    Nash Theorem

    Every finite game has at least one (mixed) strategy Nash Equilibrium

    When Mixed strategy NE

    Three possibilities:

    Mixed strategy NE

    Lecture 23:Multi-agent Reinforcement Learning

    Stochastic Games

  • (Simultaneously moving) Stochastic Game (\(N\)-agent MDP)
  • Goal: Find optimal policy such that \(\pi^* = \{\pi_1^*, \dots, \pi_N^*\}\), where \(\pi_i^* = \arg \max_{\pi^i} \sum_{t = 0}^h \gamma^t \mathbb E_{\pi} [r_t^i (s, a)]\), where \(a := \{a^1, \dots, a^N\}\) and \(\pi := \{\pi^1, \dots, \pi^N\}\)
  • The environment changes in response to the players’ choices!!!

    Playing a stochastic game

    Optimal Policy

    Independent learning

    Cooperative Stochastic Games

  • (Simultaneously moving) Stochastic Game (\(N\)-agent MDP)
  • Goal: Find optimal policy such that \(\pi^* = \{\pi_1^*, \dots, \pi_N^*\}\), where \(\pi_i^* = \arg \max_{\pi^i} \sum_{t = 0}^h \gamma^t \mathbb E_{\pi} [r_t^i (s, a)]\), where \(a := \{a^1, \dots, a^N\}\) and \(\pi := \{\pi^1, \dots, \pi^N\}\)
  • The environment changes in response to the players’ choices!!!

    Optimal Policy

    Opponent Modelling

  • Note that an agent's response requires knowledge of other agent’s actions
  • This is a simultaneously move game where each agent does not know what the other agents will do
  • So each agent should maintain a belief over other agents actions at current state
  • This process of maintaining and updating a belief over the next actions of other agents is called opponent modelling
  • Types of Opponent Modelling:
  • Fictitious Play

    Learning in cooperative stochastic games

  • Algorithm: Joint action learner (JAL) or Joint Q learning (JQL)
  • Challenge: Respond to environment as well as opponent(s)
  • Same as Q learning but agents also include the opponent action in Q-updates
  • Each agent would update its Q-values using the Bellman update: \[Q^j(s, a^j, a^{-j}) \leftarrow Q^j(s, a^j, a^{-j}) + \alpha (r^j + \gamma \max_{a'^j} Q^j(s', a'^j, a'^{-j}) - Q^j(s, a^j, a^{-j}))\]
  • Need to balance exploration exploitation tradeoff
  • Objective for agent: Find the optimal policy for best response
  • Objective for system: Find the NE of the stochastic game (or Nash Q function for the game)
  • Nash Q function: Agent's immediate reward and discounted future rewards when all agents follow the NE policy \[Q^i_*(s, a) = r^i(s, a) + \gamma \sum_{s' \in S} P(s' | s, a)v^i(s' , \pi_*^1,…, \pi_*^n)\]
  • Joint Q learning

    Please check the course notes for pseudocode.

    Convergence of joint Q learning

    Joint Q learning

    Please check the course notes for pseudocode.

    Common exploration methods

    Competitive Stochastic Games

  • (Simultaneously moving) Stochastic Game (\(N\)-agent MDP)
  • Optimal Policy

    Learning in competitive stochastic games

    Minimax Q learning

    Please check the course notes for pseudocode.

    Opponent Modelling

    Convergence of Minimax Q learning

    Exploration vs Exploitation Tradeoff

    (Mixed) Stochastic Games/ General-sum Stochastic Game

    Please check the notes for details...

    Optimal Policy

    Tired...

    Learning in General-sum stochastic games

    Tired...

    Nash Q learning

    Please check the course notes for pseudocode.

    Opponent Modelling

    Tired...

    Convergence of Nash Q-learning

    Tired...

    Exploration vs Exploitation Tradeoff

    Tired...

    Alternative approaches

    Tired...