Deep Reinforcement Learning: Overview

Referenced from a lecture by Professor Ernest Ryu

1. MDP Basics
2. Imitation Learning and Behavior Cloning
3. MDP Objective and Value Functions
4. Optimal Policies and Value Functions
5. Policy Evaluation vs. Optimization
6. Deep Policy Gradient Methods
7. MDP Basics II: Value Iteration
- 7.1. Value Iteration (VI)
- 7.2. Accelerated Value Iteration
8. Policy Iteration
- 8.1. Policy Iteration Algorithm
- 8.2. Policy Improvement Theorem
9. Challenges with Undiscounted MDPs ($\gamma=1$)
10. Why PPO for RL-LLM?
11. 2-Player Zero-Sum Games
12. Recent Developments in LLMs
13. Richard Sutton Wins 2025 Turing Award
14. AlphaGo, Test-Time Compute, and Expert Iteration
15. Obituary: Daniel Kahneman

1. MDP Basics

1.1. Markov Decision Process Definition

Reinforcement Learning (RL) considers sequential decision making within a Markov Decision Process (MDP):

Time $t=0,1,...,T$

$s_{t}\longrightarrow a_{t}\rightarrow(r_{t},s_{t+1})$

Trajectory $\tau=(s_{0},a_{0},r_{0},s_{1},a_{1},r_{1},\cdot\cdot\cdot,s_{T-1},a_{T-1},r_{T-1},s_{T})$

State $s_{t}\in\mathcal{S}$ Define $\delta^{+}=\delta\cup\{\langle term>\}$. (Always assume $\delta\ne\phi$.)
Action $a_{t}\in\mathcal{A}$ (Always assume $\mathcal{A}\ne\emptyset$.)
Reward $r_{t}\in\mathbb{R}$ We will choose policy to maximize sum of reward.
T is terminal time / stopping time. Defined by $s_{T}=$ <term> = terminal state.
$T=\infty$ possible. If probability of transition to $s=$ <term> is zero, then necessarily $T=\infty$. If $T=\infty$, the MDP is said to be a continuing task and otherwise an episodic task.
Initial distribution $s_{0}\sim p_{0}$. Often $s_{0}$ is fixed.
Transition probability $p(r,s^{\prime}|s,a)$ given by environment (usually not precisely known) and $(r_{t},s_{t+1})\sim p(\cdot,\cdot|s_{t},a_{t}).$

1.2. MDP Base Definitions

$r_{t}$ is sometimes a fully deterministic function of $(s_{t},a_{t})$. If so, write $r_{t}=r(s_{t},a_{t})$.
$S_{t+1}$ is sometimes a fully deterministic function of $(s_{t},a_{t})$.
For now, assume the dynamics is stationary. In general, $(r_{t},s_{t+1})\sim p_{t}(\cdot,\cdot|s_{t},a_{t})$. Stationarity means $p_{t}(r,s^{\prime}|s,a)=p(r,s^{\prime}|s,a)$.
$a_{t}$ is chosen by the agent via a policy $\pi$, given $S_{t}$.
If $\pi$ is stochastic, then $\pi(a|s)$ is a probability distribution and $a_{t}\sim\pi(\cdot|s_{t})$.
If $\pi$ is deterministic, then $a_{t}=\pi(s_{t})$.
Often, $\pi=\pi_{\theta}$, i.e., $\pi$ will be a neural network parameterized by $\theta$.

1.3. State vs. Observation

Tiger Behind Tree

In general, an agent may not be able to observe the full state. The system may not be Markovian with respect to the observation. E.g., if you see a tiger hide behind a tree, the past observation is relevant, and the fact that you can no longer see the tiger does not mean you are safe.

This leads to the Partially Observable Markov Decision Process (POMDP) formulation, which is much more challenging than MDPs.

For us, assume the agent has fully observes the state $S_{t}$.

In LLMs, this is not an issue since the language model has the full conversational history.

1.4. MDP Generalizations

There are many generalizations of the standard MDP. We won't cover them in this course.

Non-stationary dynamics: $p_{0},p_{1},...$
The terminal time T may be pre-determined. This is an example of non-stationary dynamics, since $p_{T-1}(s_{T}=\langle term>|s_{T-1},a_{T-1})=1$, while $p_{t}(s_{t+1}=\langle term>|s_{t},a_{t})=0$ for any $t=0,...,T-2$.
The policy $\pi_{t}$ can be time-dependent. (Time-dependent policy is necessary only if dynamics is non-stationary.)
Set of possible actions may depend on $S_{t}$. If so, $a_{t}\in\mathcal{A}(s_{t})$.

In practice, the MDP you work with will incorporate some of these variations. You must understand the principles, and adapt the base theory to the particular MDP at hand.

1.5. Terminal Time Notation

Note, terminal time T is a random variable. Therefore, $\Sigma_{t=0}^{T}$ requires caution to work with. For example, $\mathbb{E}[\Sigma_{t=0}^{T}\cdot]\ne\Sigma_{t=0}^{T}\mathbb{E}[\cdot]$ and the RHS makes no sense, since the random variable T is outside of the expectation.

For notational and theoretical convenience, define an equivalent MDP that never stops by making the terminal state an absorbing state.

The MDP nominally never stops, i.e., terminal time is T = $\infty$.
State $s_{t}\in S \cup \{\langle term>\}$, and we view <term> as a normal non-terminal state.
The transition probability $p(r,s^{\prime}|s,a)$ is defined such that if $S_{t}=$ <term>, then $r_{t}=0$ and $s_{t+1}=$ <term> with probability 1, regardless of the choice of $a_{t}$.
- I.e., Once $s_{t}=$ <term>, we no longer collect rewards and never escape <term>. Therefore, the policy at terminal state $\pi(\cdot|<\text{term}>)$ is irrelevant.
- No matter what action you take, we never move from $s_{t}=$ <term>.

2. Imitation Learning and Behavior Cloning

2.1. Behavior Cloning

In imitation learning or behavior cloning, we train a $\pi_{\theta}\approx\pi_{expert}$, where $\pi_{expert}$ is the policy of an expert agent by observing actions made by $\pi_{expert}$ (usually human demonstrations).

Behavior cloning:

Step 1: Sample trajectories $\tau=(s_{0},a_{0},s_{1},a_{1},\cdot\cdot\cdot,s_{T-1},a_{T-1},s_{T})\sim(p_{0},\pi_{expert},p)$ and form a dataset of state-action pairs: $\mathcal{D}_{expert}=\{(s_{i},a_{i})\}$. (No rewards.)

Step 2: Train the model by solving

$$ \min_{\theta} \sum_{(s,a)\in\mathcal{D}_{expert}}l(\pi_{\theta}(s),a) $$

where $l$ is the cross-entropy loss. This is basically supervised learning.

Observation: Next token prediction in LLMs can be thought of as behavior cloning.

Behavior cloning may serve as a good starting point (initialization) as we will see with AlphaGo and LLM pre-training.

In many RL settings, the requirement of expert demonstrations can be onerous. In the LLM setting, however, pre-training data is plentiful.

However, behavior cloning by itself does not work very well.

2.2. Distribution Shift in Behavior Cloning

In supervised learning, a model trained on one distribution will perform when tested on the same distribution, but not on another distribution. A change of distribution from training to test is called distribution shift, distribution mismatch, covariate shift, concept drift etc.

Behavior cloning fails due to covariate shift. At test time, the trained model observes trajectories $\tau\sim(p_{0},\pi_{\theta},p)$. But recall, training was done on trajectories $\tau\sim(p_{0},\pi_{expert},p).$

As a result, $\pi_{\theta}$ will encounter states it was not trained on. Maybe $\pi_{expert}$ is very good and never drives the MDP to a "bad" state, but $\pi_{\theta}$ may be imperfect and may drift into such bad states. Although $\pi_{expert}$ may know how to recover from such a bad state, $\pi_{\theta}$ was never taught this knowledge.

2.3. Off-policy vs. On-policy Learning

Key insight: In reinforcement learning, the training data depends on the policy. This is true for both imitation learning or reward-based learning.

This observation closely relates to the distinction between off-policy and on-policy RL. (More on this later.)

Behavior cloning is off-policy learning, since it trains on data generated by another policy.

2.4. DAgger (Dataset Aggregation)

Dataset Aggregation (DAgger) is more on-policy imitation learning. In each round, the current policy is used to generate trajectories and the expert provides labels for the states. By retraining on this aggregated dataset, DAgger mitigates distribution shift.

Sample trajectory $\tau \sim p_{0}, \pi_{\text{expert}}, p$ and form $\mathcal{D} \leftarrow \mathcal{D}_{\text{expert}} = \{s_{i}, a_{i}\}$.
Train $\pi_{\theta}$ from demonstration data $\mathcal{D}$.
Sample trajectory $\tau \sim p_{0}, \pi_{\theta}, p$ and form $\mathcal{D}_{\pi_{\theta}} = \{s_{i}\}$. (No actions, no rewards.)
Ask human to label $s_{i} \in \mathcal{D}_{\pi_{\theta}}$ with actions $a_{i}$. Then let $\mathcal{D}_{\pi_{\theta}} = \{s_{i}, a_{i}\}$.
Aggregate $\mathcal{D} \leftarrow \mathcal{D} \cup \mathcal{D}_{\pi_{\theta}}$. Loop back to Step 1.

Asking human experts to take an action on states generated by $\pi_{\theta}$ is often very unnatural. Not really something you can do with LLMs anyway.

Reference: S. Ross, G. Gordon, and D. Bagnell, A reduction of imitation learning and structured prediction to no-regret online learning, AISTATS, 2011.

2.5. Causal Confusion

Another problematic effect of distributional shift in imitation learning is causal confusion.

In supervised learning, spurious correlations can be learned, but this may be not a problem so long as the training and test distributions are the same. E.g. If all cats are looking to the left and all dogs are looking to the right, the neural net may differentiate them with the direction of their heads. A distributional shift altering the distribution of the directions the animals face can ruin the performance of this model. If the correct causal relation were learned, the model would be robust to distributional shifts.

When the agent (LLM) act itself (on-policy learning), causal confusion is more easily revealed. This can then be corrected through negative rewards or expert supervision.

Reference: P. de Haan, D. Jayaraman, and S. Levine, Causal confusion in imitation learning, NeurIPS, 2019.

3. MDP Objective and Value Functions

3.1. Maximizing Expected Discounted Return

So far, we have described the dynamics of the MDP. The goal/objective of an MDP is to maximize expected discounted return:

Cumulative discounted return $G_{t} = r_{t} + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots = \sum_{k=0}^{T-t-1} \gamma^k r_{t+k}$
$G_0$ is (cumulative) return, $r_{t}$ is (instantaneous) reward.
Discount factor $\gamma \in (0, 1]$.
When $T = \infty$ possible and reward is bounded, use $\gamma < 1$ to ensure return is finite.
$\mathbb{E}_{\pi}$ means expectation over $a_{t} \sim \pi(\cdot | s_{t})$, $r_{t}, s_{t+1} \sim p(\cdot,\cdot | s_{t}, a_{t})$.
Define the return from time $t$ as $G_{t}$.

3.2. State Value Function $V^\pi$ and State-Action Value Function $Q^\pi$

Define the (state) value function (V-value function)

$$V^{\pi}(s) = \mathbb{E}_{\pi} [G_0 | s_0 = s]$$

as the expected return starting at state $s$ following policy $\pi$.
Note, $V^{\pi}(\text{}) = 0$.

Define the state-action value function (Q-value function)

$$Q^{\pi}(s, a) = \mathbb{E}_{\pi} [G_0 | s_0 = s, a_0 = a]$$

as the expected return starting at state $s$ taking action $a$ following policy $\pi$ thereafter.
Note, $Q^{\pi}(\text{}, a) = 0$ for all $a \in \mathcal{A}$.

Basic properties of $V^\pi$ and $Q^\pi$
Since $\mathbb{E}_{\pi}$ denotes expectation over $a_{t} \sim \pi(\cdot | s_{t})$, $r_{t}, s_{t+1} \sim p(\cdot,\cdot | s_{t}, a_{t})$, we have

$$V^{\pi}(s) = \mathbb{E}_{\pi} [r_t + \gamma G_{t+1} | s_t = s]$$

and

$$Q^{\pi}(s, a) = \mathbb{E}_{\pi} [r_t + \gamma G_{t+1} | s_t = s, a_t = a]$$

By stationarity,

$$V^{\pi}(s) = \mathbb{E}_{\pi} [r_t + \gamma V^{\pi}(s_{t+1}) | s_t = s]$$

and

$$Q^{\pi}(s, a) = \mathbb{E}_{\pi} [r_t + \gamma Q^{\pi}(s_{t+1}, a_{t+1}) | s_t = s, a_t = a]$$

1-step transition property of $V^\pi$
by the Markovian property:

$$V^{\pi}(s) = \sum_a \pi(a|s) \sum_{r,s'} p(r,s'|s,a) [r + \gamma V^{\pi}(s')]$$

1-step transition property of $Q^\pi$
by the Markovian property:

$$Q^{\pi}(s, a) = \sum_{r,s'} p(r,s'|s,a) [r + \gamma \sum_{a'} \pi(a'|s') Q^{\pi}(s', a')]$$

3.3. Preliminaries: Banach Fixed Point Theorem

Let $\mathcal{X}$ be a metric space with metric $d$. Then, we say $\mathcal{T} : \mathcal{X} \to \mathcal{X}$ is $\gamma$-contractive if

$$d(\mathcal{T}(x), \mathcal{T}(y)) \le \gamma d(x, y) \quad \forall x, y \in \mathcal{X}$$

Theorem) Let $\mathcal{X}$ be a complete metric space with metric $d$. Let $\mathcal{T}: \mathcal{X} \to \mathcal{X}$ be a $\gamma$-contractive mapping with $\gamma < 1$. Then $\mathcal{T}$ has a fixed point and the fixed point is unique. Furthermore, for any $x \in \mathcal{X}$,

$$\mathcal{T}^k (x) \to x^\star \quad \text{as } k \to \infty$$

where $x^\star$ is the unique fixed point.
($\mathbb{R}^n$ is a complete metric space with any norm.)

3.4. Bellman Equation for $V^\pi$

Theorem) Let $\pi$ be a policy. Assume $\gamma \in (0,1)$, $\mathcal{S} < \infty$, and $r \le R < \infty$ almost surely. Then $V^{\pi} : \mathcal{S}^+ \to \mathbb{R}$, the value function of $\pi$, exists, and it satisfies the Bellman equation:

$$V^{\pi}(s) = \sum_a \pi(a|s) \sum_{r,s'} p(r,s'|s,a) [r + \gamma V^{\pi}(s')]$$

Conversely, if a function $V : \mathcal{S}^+ \to \mathbb{R}$ satisfies the Bellman equation, then $V = V^{\pi}$.

Proof) Existence follows from
$|G_t| \le \sum_{k=0}^{T-t-1} \gamma^k |r_{t+k}| \le R \sum_{k=0}^\infty \gamma^k = \frac{R}{1-\gamma}$
so the expectation is well defined.

Let $\mathcal{B}^{\pi}$ be the Bellman operator (for $V$) mapping from a function to a function:

$$(\mathcal{B}^{\pi} V)(s) = \sum_a \pi(a|s) \sum_{r,s'} p(r,s'|s,a) [r + \gamma V(s')]$$

So, $\mathcal{B}^{\pi} V^{\pi} = V^{\pi}$ by the 1-step transition property, i.e., $V^{\pi}$ is a fixed point of $\mathcal{B}^{\pi}$.

Let $||V||_\infty = \max_{s \in \mathcal{S}^+} |V(s)|$. Then $||\cdot||_\infty$ is a norm on the space of functions from $\mathcal{S}^+ \text{ to } \mathbb{R}$.

If $\mathcal{B}^{\pi}$ is a strict contraction with respect to $||\cdot||_\infty$, then $V^{\pi}$ is the unique fixed-point by Banach.

Finally, we show $\mathcal{B}^{\pi}$ is a $\gamma$-contraction in the $||\cdot||_\infty$-norm:

$$|(\mathcal{B}^{\pi} U)(s) - (\mathcal{B}^{\pi} V)(s)| = \left| \sum_a \pi(a|s) \sum_{r,s'} p(r,s'|s,a) [r + \gamma U(s')] - \sum_a \pi(a|s) \sum_{r,s'} p(r,s'|s,a) [r + \gamma V(s')] \right|$$ $$ = \left| \sum_a \pi(a|s) \sum_{r,s'} p(r,s'|s,a) [\gamma U(s') - \gamma V(s')] \right|$$ $$ = \gamma \left| \sum_a \pi(a|s) \sum_{r,s'} p(r,s'|s,a) [U(s') - V(s')] \right|$$ $$ \le \gamma \sum_a \pi(a|s) \sum_{r,s'} p(r,s'|s,a) |U(s') - V(s')|$$ $$ \le \gamma \sum_a \pi(a|s) \sum_{r,s'} p(r,s'|s,a) ||U - V||_\infty$$ $$ = \gamma ||U - V||_\infty$$

Thus, $||\mathcal{B}^{\pi} U - \mathcal{B}^{\pi} V||_\infty \le \gamma ||U - V||_\infty$.
$\square$

3.5. Bellman Equation for $Q^\pi$

Theorem) Let $\pi$ be a policy. Assume $\gamma \in (0,1)$, $\mathcal{S} < \infty$, $\mathcal{A} < \infty$, and $r \le R < \infty$ almost surely. Then the state-action value function $Q^{\pi}$ exists, and it satisfies the Bellman equation

$$Q^{\pi}(s, a) = \sum_{r,s'} p(r,s'|s,a) [r + \gamma \sum_{a'} \pi(a'|s') Q^{\pi}(s', a')]$$

Conversely, if a function $Q : \mathcal{S}^+ \times \mathcal{A} \to \mathbb{R}$ satisfies the Bellman equation, then $Q = Q^{\pi}$.

Proof) First, $Q^{\pi}$ is well defined since $r$ is almost surely bounded.

Let $\mathcal{B}^{\pi}$ be the Bellman operator (for $Q$) mapping from a function to a function:

$$(\mathcal{B}^{\pi} Q)(s, a) = \sum_{r,s'} p(r,s'|s,a) [r + \gamma \sum_{a'} \pi(a'|s') Q(s', a')]$$

Clearly, $\mathcal{B}^{\pi} Q^{\pi} = Q^{\pi}$ by the 1-step transition property, i.e., $Q^{\pi}$ is a fixed point of $\mathcal{B}^{\pi}$.

Let $||Q||_\infty = \max_{s \in \mathcal{S}^+, a \in \mathcal{A}} |Q(s, a)|$.
Then, $||\cdot||_\infty$ is a norm on the space of functions from $\mathcal{S}^+ \times \mathcal{A} \text{ to } \mathbb{R}$.

If $\mathcal{B}^{\pi}$ is a strict contraction with respect to $||\cdot||_\infty$, then $Q^{\pi}$ is the unique fixed-point by Banach. In the hw, you will prove that $\mathcal{B}^{\pi}$ is a strict contraction.
$\square$

4. Optimal Policies and Value Functions

4.1. Optimal Policy and Value Functions

We say a policy $\pi^\star$ is optimal if

$$V^{\pi^\star}(s) \ge V^{\pi}(s) \quad \forall s \in \mathcal{S}^+, \forall \text{ policy } \pi.$$

(Optimal policy does not depend on starting state $s$.)

Write $V^\star = V^{\pi^\star}$ for the optimal value function.
Write $Q^\star = Q^{\pi^\star}$ for the optimal Q-value function.

As we soon establish, the optimal value functions $V^\star$ and $Q^\star$ are unique, but optimal policy $\pi^\star$ is not unique. However, all optimal policies yield the same value functions.

4.2. Bellman Optimality Equation for $V^\star$

Theorem) Assume $\gamma \in (0,1)$, $\mathcal{S} < \infty$, $\mathcal{A} < \infty$, and $r \le R < \infty$ almost surely. Then the optimal value function $V^\star : \mathcal{S}^+ \to \mathbb{R}$ exists, and it satisfies the Bellman optimality equation:

$$V^\star(s) = \max_a \sum_{r,s'} p(r,s'|s,a) [r + \gamma V^\star(s')]$$

Conversely, if a function $V : \mathcal{S}^+ \to \mathbb{R}$ satisfies the Bellman optimality equation, then $V = V^\star$.
Finally,

$$\pi^\star(s) = \arg\max_a \sum_{r,s'} p(r,s'|s,a) [r + \gamma V^\star(s')]$$

is an optimal deterministic policy.

Quick lemma
Lemma) For any $u(s)$ and $v(s)$,

$$\max_a (f(a) + g(a)) - \max_a f(a) \le \max_a g(a)$$

Proof)
Let $a_u = \arg\max_a (f(a) + g(a))$ and $a_f = \arg\max_a f(a)$.
Then,

$$(f(a_u) + g(a_u)) - f(a_f) \le (f(a_u) + g(a_u)) - f(a_u) = g(a_u) \le \max_a g(a)$$

and symmetrically,

$$(f(a_f) + g(a_f)) - f(a_u) \le (f(a_f) + g(a_f)) - f(a_f) = g(a_f) \le \max_a g(a)$$

We conclude with

$$|\max_a (f(a) + g(a)) - \max_a f(a)| \le \max_a |g(a)|$$

$\square$

Proof) Let $\mathcal{B}^\star$ be the Bellman optimality operator (for $V$) mapping from a function to a function:

$$(\mathcal{B}^\star V)(s) = \max_a \sum_{r,s'} p(r,s'|s,a) [r + \gamma V(s')]$$

By the following reasoning, $\mathcal{B}^\star$ is a strict contraction with respect to $||\cdot||_\infty$.

$$|(\mathcal{B}^\star U)(s) - (\mathcal{B}^\star V)(s)| = |\max_a \sum_{r,s'} p(r,s'|s,a) [r + \gamma U(s')] - \max_a \sum_{r,s'} p(r,s'|s,a) [r + \gamma V(s')]|$$

Let $f(a) = \sum_{r,s'} p(r,s'|s,a) r$ and $g(a) = \sum_{r,s'} p(r,s'|s,a) \gamma U(s') - \sum_{r,s'} p(r,s'|s,a) \gamma V(s')$.
Then, $|g(a)| \le \gamma ||U-V||_\infty$.
So $\mathcal{B}^\star$ has a unique fixed-point that we denote as $V^\star$.
(We don’t yet know if $V^\star$ is the optimal value function.)

Next, define $\pi^\star$ to be a deterministic policy defined by

$$\pi^\star(s) = \arg\max_a \sum_{r,s'} p(r,s'|s,a) [r + \gamma V^\star(s')]$$

where ties in the argmax are broken arbitrarily. (We don’t yet know if $\pi^\star$ is the optimal policy.)
Then, $V^\star = V^{\pi^\star}$ by

$$V^\star(s) = \sum_a \pi^\star(a|s) \sum_{r,s'} p(r,s'|s,a) [r + \gamma V^\star(s')]$$

(This is the Bellman equation for $V^{\pi^\star}$, so $V^\star = V^{\pi^\star}$ by the uniqueness of the fixed point.)

Lemma) Let $\pi$ be a policy. Let $\mathcal{B}^{\pi}$ be the Bellman operator and $\mathcal{B}^\star$ the Bellman optimality operator. For any $V : \mathcal{S}^+ \to \mathbb{R}$, we have

$$\mathcal{B}^{\pi} V \le \mathcal{B}^\star V$$

Moreover, for any $U : \mathcal{S}^+ \to \mathbb{R}$ and $V : \mathcal{S}^+ \to \mathbb{R}$ satisfying $U \le V$, we have

$$\mathcal{B}^\star U \le \mathcal{B}^\star V$$

(Here, $\le$ denotes pointwise inequality. So $U \le V$ means $U(s) \le V(s)$ for all $s \in \mathcal{S}^+$.)
Proof) Homework.

We now show that $\pi^\star$ is optimal. Let $\pi$ be any policy. Then,

$$V^{\pi} = \mathcal{B}^{\pi} V^{\pi} \le \mathcal{B}^\star V^{\pi} \le (\mathcal{B}^\star)^2 V^{\pi} \le \cdots \le (\mathcal{B}^\star)^k V^{\pi} \to V^\star = V^{\pi^\star}$$

($V^{\pi} = \mathcal{B}^{\pi} V^{\pi}$ follows from the Bellman equation theorem. $\mathcal{B}^{\pi} V^{\pi} \le \mathcal{B}^\star V^{\pi}$ follows from the previous lemma. Since $V^{\pi} \le \mathcal{B}^\star V^{\pi}$, we can apply $\mathcal{B}^\star$ to both sides to get $\mathcal{B}^\star V^{\pi} \le (\mathcal{B}^\star)^2 V^{\pi}$. Convergence is due to Banach.)
So $V^{\pi} \le V^{\pi^\star}$ for any policy $\pi$. So $\pi^\star$ is optimal and $V^\star$ is the optimal value function.
Finally, since $V^\star = V^{\pi^\star}$, we have

$$V^\star(s) = \max_a \sum_{r,s'} p(r,s'|s,a) [r + \gamma V^\star(s')]$$

$\square$

4.3. Bellman Optimality Equation for $Q^\star$

Theorem) Assume $\gamma \in (0,1)$, $\mathcal{S} < \infty$, $\mathcal{A} < \infty$, and $r \le R < \infty$ almost surely. Then the optimal state-action value function $Q^\star : \mathcal{S}^+ \times \mathcal{A} \to \mathbb{R}$ exists, and it satisfies the Bellman optimality equation:

$$Q^\star(s, a) = \sum_{r,s'} p(r,s'|s,a) [r + \gamma \max_{a'} Q^\star(s', a')]$$

Conversely, if a function $Q : \mathcal{S}^+ \times \mathcal{A} \to \mathbb{R}$ satisfies the Bellman optimality equation, then $Q = Q^\star$. Finally,

$$\pi^\star(s) = \arg\max_a Q^\star(s, a)$$

is an optimal deterministic policy.

Proof) Let $\mathcal{B}^\star$ be the Bellman optimality operator (for $Q$) mapping from a function to a function:

$$(\mathcal{B}^\star Q)(s, a) = \sum_{r,s'} p(r,s'|s,a) [r + \gamma \max_{a'} Q(s', a')]$$

In the homework assignment, you will show $\mathcal{B}^\star$ is a strict contraction with respect to $||\cdot||_\infty$. So $\mathcal{B}^\star$ has a unique fixed-point that we denote as $Q^\star$.

It remains to show that $Q^\star$ is the optimal state-value function.
We have $\max_{a \in \mathcal{A}} Q^\star(s, a) = V^\star(s)$ since

$$V^\star(s) = \max_a \sum_{r,s'} p(r,s'|s,a) [r + \gamma V^\star(s')]$$ $$ = \max_a \sum_{r,s'} p(r,s'|s,a) [r + \gamma \max_{a''} Q^\star(s', a'')]$$ $$ = \max_a Q^\star(s, a)$$

By our previous theorem, an optimal policy $\pi^\star$ exists and $V^\star = V^{\pi^\star}$.
We conclude $Q^\star = Q^{\pi^\star}$ with

$$Q^{\pi^\star}(s, a) = \sum_{r,s'} p(r,s'|s,a) [r + \gamma V^{\pi^\star}(s')]$$ $$ = \sum_{r,s'} p(r,s'|s,a) [r + \gamma \max_{a'} Q^{\pi^\star}(s', a')]$$

$\square$

5. Policy Evaluation vs. Optimization

The goal of RL is to find a good policy $\pi$.
We start by studying how to evaluate a given policy $\pi$.

Called policy evaluation.

We later talk about how to optimize/improve the policy $\pi$.

Called policy optimization, policy improvement, control.

5.1. Policy Evaluation: Monte Carlo (MC)

First consider approximating the value function $V^{\pi}$ of a given policy $\pi$.
For a given $s \in \mathcal{S}$, Monte Carlo (MC) method is

$$V^{\pi}(s) \approx \frac{1}{N} \sum_{i=1}^N G_0^{(i)}$$

where $G_0^{(i)}$ is the return from the $i$-th trajectory, and $N$ independent trajectories with $s_0^{(i)} = s$.
If the size of $\mathcal{S}$ is small, we could do this for all $s \in \mathcal{S}$ to approximate $V^{\pi}(\cdot)$.

$V_\phi \approx V^\pi$ with MC
When the size of $\mathcal{S}$ is large or infinite, we must approximate $V^{\pi}$ via a neural network $V_\phi$:

$$\mathcal{L}(\phi) = \mathbb{E}_{\tau \sim (p_0, \pi, p)} \left[ \sum_{t=0}^{T-1} (G_t - V_\phi(s_t))^2 \right]$$

Since $V^{\pi}(\text{}) = 0$, we usually let $V_\phi : \mathcal{S} \to \mathbb{R}$, i.e., we don’t allow $V_\phi(\text{})$ and hard-code the case when $V^{\pi}(\text{}) = 0$ is needed.

Gradient of $\mathcal{L}$:

$$\nabla_\phi \mathcal{L}(\phi) = \mathbb{E}_{\tau \sim (p_0, \pi, p)} \left[ \sum_{t=0}^{T-1} 2(G_t - V_\phi(s_t)) (-\nabla_\phi V_\phi(s_t)) \right]$$ $$ = \mathbb{E}_{\tau \sim (p_0, \pi, p)} \left[ \sum_{t=0}^{T-1} 2(V_\phi(s_t) - G_t) \nabla_\phi V_\phi(s_t) \right]$$

$V_\phi \approx V^\pi$ with MC + SGD
The stochastic gradient is $g = \sum_{t=0}^{T-1} 2(V_\phi(s_t) - G_t) \nabla_\phi V_\phi(s_t)$, where $\tau$ is a single trajectory.
The update is $\phi \leftarrow \phi - \alpha g$.

5.2. Policy Evaluation: Temporal Difference (TD)

In Temporal Difference (TD) learning, we use the reward of one transition.

$$\mathcal{L}(\phi) = \mathbb{E}_{\tau \sim (p_0, \pi, p)} \left[ \sum_{t=0}^{T-1} (r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t))^2 \right]$$

Exactly actionable only if $V^{\pi}$ is known.
$N$ independent 1-step transitions with $s_0^{(i)} = s$.

$V_\phi \approx V^\pi$ with TD + (apx) SGD
Gradient of $\mathcal{L}$:

$$\nabla_\phi \mathcal{L}(\phi) = \mathbb{E}_{\tau \sim (p_0, \pi, p)} \left[ \sum_{t=0}^{T-1} 2(r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)) (\gamma \nabla_\phi V_\phi(s_{t+1}) - \nabla_\phi V_\phi(s_t)) \right]$$

5.3. Stop-Gradient Operator

How should we compute the approximate stochastic gradient?
Option 1. Compute $(r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t))$ (a scalar) and $\nabla_\phi V_\phi(s_t)$ and multiply the two.
This is correct, but it is cumbersome.

Option 2. Forward-evaluate $r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$ and backprop to compute $\nabla_\phi (r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t))^2$.
This is incorrect! If you apply the chain rule, you don’t get the same $g$.

Option 3. Forward-evaluate $r_t + \gamma V_\phi(s_{t+1})$, forward-evaluate $V_\phi(s_t)$ and backprop to compute $\nabla_\phi (r_t + \gamma \overline{V_\phi(s_{t+1})} - V_\phi(s_t))^2$, where $\overline{\cdot}$ denotes the stop-gradient operator.
(In PyTorch, use `.detach()` to perform stop-gradient.)
The stop-gradient operator treats its input as a constant, and stops backpropagation, i.e., the chain rule is not invoked on the input of $\overline{\cdot}$.

$V_\phi \approx V^\pi$ with TD + (apx) SGD
Using the stop-gradient operator, express TD with approximate SGD as

$$\phi \leftarrow \phi - \alpha (r_t + \gamma \overline{V_\phi(s_{t+1})} - V_\phi(s_t)) \nabla_\phi V_\phi(s_t)$$

If $s_{t+1} == \text{}$, then $V_\phi(s_{t+1}) = 0$.

5.4. Semi-Gradient Method

TD + apx SGD with $\overline{V_\phi(s_{t+1})}$ is provably not an instance of gradient descent.
These methods are called semi-gradient methods, in the sense that it kind of resembles a gradient method.

One can replace the gradient computation with $\nabla_\phi (r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t))^2$, i.e., try to directly minimize the Bellman error.
This has been explored under the name gradient TD (GTD), but GTD tends to perform worse than the semi-gradient TD.

References:
Appendix I, E. Barnard, Temporal-difference methods and Markov models, IEEE Transactions on Systems, Man, and Cybernetics, 1993.
R. S. Sutton, H. Maei, C. Szepesvári, A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation, NeurIPS, 2008.
R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora, Fast gradient-descent methods for temporal-difference learning with linear function approximation, ICML, 2009.

5.5. k-step TD Policy Evaluation

k-step TD interpolates between MC and (1-step) TD using the k-step transition property:

$$G_t^{(k)} = r_t + \gamma r_{t+1} + \cdots + \gamma^{k-1} r_{t+k-1} + \gamma^k V^{\pi}(s_{t+k})$$

where $k \wedge T = \min(k, T)$.

k-step transition property

$$V^{\pi}(s) = \mathbb{E}_{\pi} [G_t^{(k)} | s_t = s]$$

Exactly actionable only if $V^{\pi}$ is known.
$N$ independent partial trajectories with $s_0^{(i)} = s$.

$V^{\pi} \approx V_\phi$ with k-step TD
Gradient of $\mathcal{L}$:

$$\nabla_\phi \mathcal{L}(\phi) = \mathbb{E}_{\tau \sim (p_0, \pi, p)} \left[ \sum_{t=0}^{T-1} 2(G_t^{(k)} - V_\phi(s_t)) (-\nabla_\phi V_\phi(s_t)) \right]$$ $$ = \mathbb{E}_{\tau \sim (p_0, \pi, p)} \left[ \sum_{t=0}^{T-1} 2(V_\phi(s_t) - G_t^{(k)}) \nabla_\phi V_\phi(s_t) \right]$$

$V^{\pi} \approx V_\phi$ with k-step TD + (apx) SGD
Approximate SGD implementation ($g$ is no longer an unbiased estimate of $\nabla\mathcal{L}$)

$$\phi \leftarrow \phi - \alpha (G_t^{(k)} - V_\phi(s_t)) \nabla_\phi V_\phi(s_t)$$

If $s_{k \wedge T} == \text{}$, then $V_\phi(s_{k \wedge T}) = 0$.

Using the stop-gradient operator, express k-step TD with approximate SGD as

$$\phi \leftarrow \phi - \alpha (\overline{G_t^{(k)}} - V_\phi(s_t)) \nabla_\phi V_\phi(s_t)$$

If $s_{k \wedge T} == \text{}$, then $V_\phi(s_{k \wedge T}) = 0$.
(Here, the random sampling of $r_t$ does not depend on $\phi$, so $\partial r_t / \partial \phi = 0$. This will change when we perform policy optimization.)

5.6. MC vs. TD (Bootstrapping)

The goal is to train $V_\phi \approx V^{\pi}$. When we need to use $V^{\pi}$, we replace it with $V_\phi$.
This is called bootstrapping.

MC evaluation updates $V_\phi$ by waiting for the episode to terminate.
TD evaluation updates $V_\phi$ without waiting for the episode to terminate; they instead bootstrap and use $V_\phi$.
Initially, when $V_\phi$ is inaccurate, bootstrapping may cause instabilities.
Later, when $V_\phi$ is accurate, waiting for the episode to terminate may be wasteful.
As a rule of thumb, an intermediate value of $k = 5$ is a good compromise.

5.7. $Q_\phi \approx Q^\pi$ with MC

Policy evaluation for $Q^{\pi}$ follows the same principles.
For MC,

$$Q^{\pi}(s, a) \approx \frac{1}{N} \sum_{i=1}^N G_0^{(i)}$$

where $G_0^{(i)}$ is the return from the $i$-th trajectory, and $N$ independent trajectories with $s_0^{(i)} = s$ and $a_0^{(i)} = a$.

$Q_\phi \approx Q^\pi$ with MC

$$\mathcal{L}(\phi) = \mathbb{E}_{\tau \sim (p_0, \pi, p)} \left[ \sum_{t=0}^{T-1} (G_t - Q_\phi(s_t, a_t))^2 \right]$$

Gradient of $\mathcal{L}$:

$$\nabla_\phi \mathcal{L}(\phi) = \mathbb{E}_{\tau \sim (p_0, \pi, p)} \left[ \sum_{t=0}^{T-1} 2(Q_\phi(s_t, a_t) - G_t) \nabla_\phi Q_\phi(s_t, a_t) \right]$$

$Q_\phi \approx Q^\pi$ with MC + SGD
The stochastic gradient is $g = \sum_{t=0}^{T-1} 2(Q_\phi(s_t, a_t) - G_t) \nabla_\phi Q_\phi(s_t, a_t)$, where $\tau$ is a single trajectory.
The update is $\phi \leftarrow \phi - \alpha g$.

5.8. $Q_\phi \approx Q^\pi$ with TD + (Approx) SGD

1-step TD works with the same principles.

$$\mathcal{L}(\phi) = \mathbb{E}_{\tau \sim (p_0, \pi, p)} \left[ \sum_{t=0}^{T-1} (r_t + \gamma Q_\phi(s_{t+1}, a_{t+1}) - Q_\phi(s_t, a_t))^2 \right]$$

Exactly actionable only if $Q^{\pi}$ is known.
$N$ independent 1-step transitions and action $a_{t+1}^{(i)}$ with $s_t^{(i)} = s$ and $a_t^{(i)} = a$.

$Q_\phi \approx Q^\pi$ with TD + (apx) SGD
Gradient of $\mathcal{L}$:

$$\nabla_\phi \mathcal{L}(\phi) = \mathbb{E}_{\tau \sim (p_0, \pi, p)} \left[ \sum_{t=0}^{T-1} 2(r_t + \gamma Q_\phi(s_{t+1}, a_{t+1}) - Q_\phi(s_t, a_t)) (\gamma \nabla_\phi Q_\phi(s_{t+1}, a_{t+1}) - \nabla_\phi Q_\phi(s_t, a_t)) \right]$$

Approximate SGD implementation ($g$ is no longer an unbiased estimate of $\nabla\mathcal{L}$)

$$\phi \leftarrow \phi - \alpha (r_t + \gamma Q_\phi(s_{t+1}, a_{t+1}) - Q_\phi(s_t, a_t)) \nabla_\phi Q_\phi(s_t, a_t)$$

If $s_{t+1} == \text{}$, then $Q_\phi(s_{t+1}, a_{t+1}) = 0$.

Using the stop-gradient operator, express TD with approximate SGD as

$$\phi \leftarrow \phi - \alpha (r_t + \gamma \overline{Q_\phi(s_{t+1}, a_{t+1})} - Q_\phi(s_t, a_t)) \nabla_\phi Q_\phi(s_t, a_t)$$

If $s_{t+1} == \text{}$, then $Q_\phi(s_{t+1}, a_{t+1}) = 0$.

5.9. $Q_\phi \approx Q^\pi$ with k-step TD + (Approx) SGD

k-step TD works with the same principles.

$$G_t^{(k)} = r_t + \gamma r_{t+1} + \cdots + \gamma^{k-1} r_{t+k-1} + \gamma^k Q^{\pi}(s_{t+k}, a_{t+k})$$

Exactly actionable only if $Q^{\pi}$ is known.
$N$ independent partial trajectories with $s_0^{(i)} = s$ and $a_0^{(i)} = a$.

$Q_\phi \approx Q^\pi$ with k-step TD + SGD
Gradient of $\mathcal{L}$:

$$\nabla_\phi \mathcal{L}(\phi) = \mathbb{E}_{\tau \sim (p_0, \pi, p)} \left[ \sum_{t=0}^{T-1} 2(G_t^{(k)} - Q_\phi(s_t, a_t)) (-\nabla_\phi Q_\phi(s_t, a_t)) \right]$$

Approximate SGD implementation ($g$ is no longer an unbiased estimate of $\nabla\mathcal{L}$)

$$\phi \leftarrow \phi - \alpha (G_t^{(k)} - Q_\phi(s_t, a_t)) \nabla_\phi Q_\phi(s_t, a_t)$$

If $k < T$, then $Q_\phi(s_k, a_k)$ is used.

Using the stop-gradient operator, express k-step TD with approximate SGD as

$$\phi \leftarrow \phi - \alpha (\overline{G_t^{(k)}} - Q_\phi(s_t, a_t)) \nabla_\phi Q_\phi(s_t, a_t)$$

If $k < T$, then $Q_\phi(s_k, a_k)$ is used.

6. Deep Policy Gradient Methods

6.1. Policy Optimization

So far, we talked about policy evaluation: approximating $V^{\pi}$ and $Q^{\pi}$ given a policy $\pi$.
Next, let’s talk about policy optimization: solving

$$\max_{\theta} \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} [G_0]$$

where $\pi_\theta$ is represented by a neural network with parameter $\theta$.

6.2. Policy Gradient Fundamentals

For policy gradient methods, Assume $T < \infty$ with probability 1.
Write $\tau$ for the trajectory and use the notation

$$\mathcal{J}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} [G_0(\tau)]$$

The probability distribution of $\tau$ can be written as

$$p(\tau | \theta) = p_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t) p(r_t, s_{t+1}|s_t, a_t)$$

Policy after stopping
For notational convenience, let $\tilde{a} \in \mathcal{A}$ be a certain action and let

$$\pi_\theta(\tilde{a} | \text{}) = 1$$

I.e., once we reach the terminal state $s = \text{}$, always take the action $\tilde{a}$.
(Remember, actions are irrelevant once we reach <term>.)
Then,

$$\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} [G_0(\tau) \nabla_\theta \log p(\tau | \theta)]$$ $$ = \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} [G_0(\tau) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t)]$$

6.3. Stochastic Gradient and Variance Reduction

So $g(\tau)$ with $\tau \sim (p_0, \pi_\theta, p)$ is an unbiased estimator of $\nabla\mathcal{J}(\theta)$.
However, current $g(\tau)$ has large variance, so we must reduce it.

Enhancement #1: Removing past rewards
We can reduce the variance by removing past rewards until time $t-1$:

$$\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} \left[ \sum_{t=0}^{T-1} G_t(\tau) \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$$

Justification:
Let $X_t = G_t(\tau) \nabla_\theta \log \pi_\theta(a_t|s_t)$. Then

$$\mathbb{E} \left[ \sum_{t=0}^{T-1} G_0(\tau) \nabla_\theta \log \pi_\theta(a_t|s_t) \right] = \mathbb{E} \left[ \sum_{t=0}^{T-1} ( \sum_{k=0}^{t-1} \gamma^k r_k + \gamma^t G_t(\tau) ) \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$$ $$ = \mathbb{E} \left[ \sum_{t=0}^{T-1} \gamma^t G_t(\tau) \nabla_\theta \log \pi_\theta(a_t|s_t) \right] + \mathbb{E} \left[ \sum_{t=0}^{T-1} \sum_{k=0}^{t-1} \gamma^k r_k \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$$

Here, $\nabla_\theta$ here examines how the probabilities for $a_t$ should change. However, a change of $a_t$ does not affect the past rewards $r_1, \ldots, r_{t-1}$. Past rewards only contribute to unnecessary variance.
By the tower property, $\mathbb{E} \left[ \sum_{k=0}^{t-1} \gamma^k r_k \nabla_\theta \log \pi_\theta(a_t|s_t) | s_t \right]$ has expectation 0 (with respect to $a_t$), so

$$\mathbb{E} \left[ \sum_{t=0}^{T-1} \sum_{k=0}^{t-1} \gamma^k r_k \nabla_\theta \log \pi_\theta(a_t|s_t) \right] = 0$$

Let us further reduce the variance.
$\square$

6.4. Enhancement #2: State-Dependent Baseline

Let $b : \mathcal{S} \to \mathbb{R}$ be a state-dependent baseline.
Then,

$$\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} \left[ \sum_{t=0}^{T-1} (G_t(\tau) - b(s_t)) \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$$

is an unbiased estimator of $\nabla\mathcal{J}(\theta)$.

Why unbiased?
Let $X_t = (G_t(\tau) - b(s_t)) \nabla_\theta \log \pi_\theta(a_t|s_t)$. Then

$$\mathbb{E}[X_t | s_t] = \mathbb{E}[G_t(\tau) \nabla_\theta \log \pi_\theta(a_t|s_t) | s_t] - \mathbb{E}[b(s_t) \nabla_\theta \log \pi_\theta(a_t|s_t) | s_t]$$

By the tower property, the following is a sum of zero-mean random variables

$$\mathbb{E}[b(s_t) \nabla_\theta \log \pi_\theta(a_t|s_t) | s_t] = b(s_t) \sum_{a_t} \pi_\theta(a_t|s_t) \frac{\nabla_\theta \pi_\theta(a_t|s_t)}{\pi_\theta(a_t|s_t)} = b(s_t) \nabla_\theta \sum_{a_t} \pi_\theta(a_t|s_t) = b(s_t) \nabla_\theta (1) = 0$$

6.5. Enhancement #3: Q-Estimates

Theorem) Let $\hat{Q}_t^{T-1}$ be a random variable such that $\mathbb{E}[\hat{Q}_t | s_t, a_t] = Q^{\pi_\theta}(s_t, a_t)$.
Let $b(s)$ be any (measurable) deterministic function of $s \in \mathcal{S}$. Then

$$\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} \left[ \sum_{t=0}^{T-1} (\hat{Q}_t - b(s_t)) \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$$

Proof) We already established

$$\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} \left[ \sum_{t=0}^{T-1} G_t(\tau) \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$$

Next,

$$\mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} \left[ \sum_{t=0}^{T-1} (\hat{Q}_t - G_t) \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$$ $$ = \sum_{t=0}^{T-1} \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} \left[ (\hat{Q}_t - G_t) \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$$

follows from

$$\mathbb{E}[(\hat{Q}_t - G_t) \nabla_\theta \log \pi_\theta(a_t|s_t) | s_t] = \mathbb{E}[\hat{Q}_t \nabla_\theta \log \pi_\theta(a_t|s_t) | s_t] - \mathbb{E}[G_t \nabla_\theta \log \pi_\theta(a_t|s_t) | s_t]$$ $$ = \sum_{a_t} \pi_\theta(a_t|s_t) \frac{\nabla_\theta \pi_\theta(a_t|s_t)}{\pi_\theta(a_t|s_t)} \mathbb{E}[\hat{Q}_t | s_t, a_t] - \sum_{a_t} \pi_\theta(a_t|s_t) \frac{\nabla_\theta \pi_\theta(a_t|s_t)}{\pi_\theta(a_t|s_t)} \mathbb{E}[G_t | s_t, a_t]$$ $$ = \sum_{a_t} \nabla_\theta \pi_\theta(a_t|s_t) Q^{\pi_\theta}(s_t, a_t) - \sum_{a_t} \nabla_\theta \pi_\theta(a_t|s_t) Q^{\pi_\theta}(s_t, a_t) = 0$$

$\square$

With the choice $\hat{Q}_t = Q^{\pi_\theta}$ and $b = V_\phi$

$$\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} \left[ \sum_{t=0}^{T-1} (Q^{\pi_\theta}(s_t, a_t) - V_\phi(s_t)) \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$$

is an unbiased estimator of $\nabla\mathcal{J}(\theta)$ by the policy gradient theorem. Here, $\tau \sim (p_0, \pi_\theta, p)$ and $V_\phi$ is a neural network approximating $V^{\pi_\theta}$.
(This is an unbiased estimate regardless of the accuracy of $V_\phi \approx V^{\pi_\theta}$. However, we must use the exact $Q^{\pi_\theta}$ to have exact unbiasedness.)
This choice $\hat{Q}_t = Q^{\pi_\theta}$ and $b = V_\phi$ leads to small (but not optimally small) variance. Why?

6.6. Minimum-Variance Conditional Estimator

Rao–Blackwell Theorem
Theorem) Let $X$ and $Y$ be random variables. Let $\hat{I}_1(X, Y)$ be an unbiased estimator of $I$, i.e., $\mathbb{E}[\hat{I}_1(X, Y)] = I$.
Let $\hat{I}_2(Y) = \mathbb{E}[\hat{I}_1(X, Y) | Y]$. Then $\hat{I}_2(Y)$ is also an unbiased estimator of $I$ and $\text{Var}(\hat{I}_2(Y)) \le \text{Var}(\hat{I}_1(X, Y))$.
$\hat{I}_2(Y)$ is called a Rao–Blackwellized estimator of $\hat{I}_1(X, Y)$.

This motivates $\hat{Q}_t = Q^{\pi_\theta}$ with $Y = (s_t, a_t)$ and $X = (r_{t+1}, s_{t+1}, a_{t+1}, \ldots)$, i.e., $Q^{\pi_\theta}(s_t, a_t)$ is a good choice as it has all variance beyond $(s_t, a_t)$ removed through conditional expectation and therefore has low variance.

Proof) Unbiasedness of $\hat{I}_2$ follows from the tower property of expectation.
Variance bound follows from Jensen’s inequality:

$$\text{Var}(\hat{I}_1) = \mathbb{E}[\text{Var}(\hat{I}_1 | Y)] + \text{Var}(\mathbb{E}[\hat{I}_1 | Y]) = \mathbb{E}[\text{Var}(\hat{I}_1 | Y)] + \text{Var}(\hat{I}_2) \ge \text{Var}(\hat{I}_2)$$

$\square$

Minimum-variance conditional estimator lemma
Lemma) Let $s$ and $a$ be random variables. Let $w(s, a)$, $Q(s, a)$, and $b(s)$ be functions. Then

$$\mathbb{E}_{s,a} [w(s, a) (Q(s, a) - b(s))^2] \ge \mathbb{E}_{s,a} [w(s, a) (Q(s, a) - b^\star(s))^2]$$

with

$$b^\star(s) = \frac{\mathbb{E}_a [w(s, a) Q(s, a) | s]}{\mathbb{E}_a [w(s, a) | s]}$$

(When $w(s, a) = 1$, this lemma reduces to a standard result in Bayesian statistics: Conditional mean is the minimum mean squared error estimator. In fact, if $\rho(s, a)$ is a probability density function of $s, a$, then $b^\star(s)$ is the conditional mean of $Q(s, a)$ with respect to the probability distribution proportional to $w^2\rho$.)

Proof)

$$\mathbb{E}_{s,a} [w(s, a) (Q(s, a) - b(s))^2] = \mathbb{E}_s \left[ \mathbb{E}_a [w(s, a) (Q(s, a) - b(s))^2 | s] \right]$$ $$\mathbb{E}_a [w(s, a) (Q(s, a) - b(s))^2 | s] = \mathbb{E}_a [w(s, a) Q(s, a)^2 - 2w(s, a) Q(s, a) b(s) + w(s, a) b(s)^2 | s]$$ $$ = \mathbb{E}_a [w(s, a) Q(s, a)^2 | s] - 2b(s) \mathbb{E}_a [w(s, a) Q(s, a) | s] + b(s)^2 \mathbb{E}_a [w(s, a) | s]$$

This is a quadratic in $b(s)$, minimized when $b(s) = b^\star(s)$, with equality attained at $b(s) = b^\star(s)$. The step follows from the definition of $b^\star(s)$:

$$ \frac{d}{db(s)} \left( -2b(s) \mathbb{E}_a [w(s, a) Q(s, a) | s] + b(s)^2 \mathbb{E}_a [w(s, a) | s] \right) = 0$$ $$ -2 \mathbb{E}_a [w(s, a) Q(s, a) | s] + 2b(s) \mathbb{E}_a [w(s, a) | s] = 0$$ $$ b(s) = \frac{\mathbb{E}_a [w(s, a) Q(s, a) | s]}{\mathbb{E}_a [w(s, a) | s]} $$

$\square$

The minimum-variance conditional estimator lemma suggests the choice

$$b^\star(s_t) = \frac{\mathbb{E}_{a_t \sim \pi_\theta(\cdot|s_t)} [Q^{\pi_\theta}(s_t, a_t)]}{\mathbb{E}_{a_t \sim \pi_\theta(\cdot|s_t)} [1]} = \mathbb{E}_{a_t \sim \pi_\theta(\cdot|s_t)} [Q^{\pi_\theta}(s_t, a_t)] = V^{\pi_\theta}(s_t)$$

However, this choice is cumbersome as it is an entirely new quantity not used later in the estimation of $V^{\pi_\theta}$. Therefore, use the simplified surrogate

$$b(s_t) = V_\phi(s_t)$$

The choice $b = V_\phi \approx V^{\pi_\theta}$ is not optimal, but it is a reasonable proxy of the optimal choice.

6.7. Interpretation via Advantage Estimation

So, we have

$$\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} \left[ \sum_{t=0}^{T-1} (Q^{\pi_\theta}(s_t, a_t) - V_\phi(s_t)) \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$$

$A^{\pi_\theta}(s_t, a_t) = Q^{\pi_\theta}(s_t, a_t) - V^{\pi_\theta}(s_t)$ is called the advantage of $a_t$ at $s_t$.

If $A^{\pi_\theta}(s_t, a_t) > 0$, then $a_t$ is a good action, better than the average action of $\pi_\theta$.
If $A^{\pi_\theta}(s_t, a_t) < 0$, then $a_t$ is a bad action, worse than the average action of $\pi_\theta$.

This gradient estimate uses $Q^{\pi_\theta}(s_t, a_t) - V_\phi(s_t)$, an approximation of the advantage.

If $a_t$ is good, $\nabla_\theta$ points in a direction to make $a_t$ more likely.
If $a_t$ is bad, $\nabla_\theta$ points in a direction to make $a_t$ less likely.

Aside: For an optimal policy $\pi^\star$,

$$A^{\pi^\star}(s_t, a_t) = Q^{\pi^\star}(s_t, a_t) - V^{\pi^\star}(s_t) \le 0$$

Proof) Hw assignment. $\square$

Without the baseline, the sign of $Q^{\pi_\theta}(s_t, a_t)$ is not directly correlated with whether $a_t$ is good or bad. (In fact, many MDPs only have positive rewards and thus have $Q^{\pi_\theta}(s_t, a_t) \ge 0$.) Rather, $\nabla_\theta$ will adjust the probability of $a_t$, and $a_t$ becomes more likely by being pushed up harder than the other actions due to the normalization of $\pi_\theta(\cdot|s_t)$.
With the baseline, $a_t$ should be push up only when $a_t$ is good.

6.8. Interpretation as an Actor-Critic Method

Loosely speaking actor-critic methods have an “actor” (i.e., policy $\pi_\theta$) and a separate “critic” that evaluates the actor’s action.
In policy gradient methods, the learned state-value function $V_\phi$ serves as the critic.

6.9. Approximating Q-Values and k-step TD

Replacing $Q^{\pi_\theta}$ with $Q_\phi$?
Should we replace $Q^{\pi_\theta}$ with $Q_\phi \approx Q^{\pi_\theta}$?
Then $g$ is a biased estimator of $\nabla_\theta\mathcal{L}(\theta)$. This is possible, but a few problems:

We must learn both $Q_\phi$ and $V_\phi$.
Replacing $G_t$ with $Q_\phi(s_t, a_t)$ will reduce the variance, but the bias will initially be large since $Q_\phi \not\approx Q^{\pi_\theta}$ initially.

Enhancement #4: k-step TD
Use k-step TD

$$\hat{Q}_t^{(k)} = r_t + \gamma r_{t+1} + \cdots + \gamma^{k-1} r_{t+k-1} + \gamma^k V_\phi(s_{t+k})$$

So with $\hat{Q}_t^{(k)}$

$$\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} \left[ \sum_{t=0}^{T-1} (\hat{Q}_t^{(k)} - V_\phi(s_t)) \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$$

is an unbiased estimator of $\nabla\mathcal{J}(\theta)$, but is not implementable without $V_\phi$.
is a biased estimator of $\nabla\mathcal{J}(\theta)$.

Now, we only need to learn $V_\phi \approx V^{\pi_\theta}$. No need to separately learn $Q_\phi$.
Even when $V_\phi$ is initially wrong, then

$$\hat{Q}_t = r_t + \gamma r_{t+1} + \cdots + \gamma^{k-1} r_{t+k-1} + \gamma^k V_\phi(s_{t+k})$$

is still reasonably informative since the rewards $r_t, \ldots, r_{t+k-1}$ provide informative unbiased information of the quality of the action $a_t$, even if $V_\phi(s_{t+k})$ has large bias and is completely non-informative. (Especially so when $\gamma < 1$.)

6.10. Policy Gradient Algorithm #1

Initialize $\theta$ and $\phi$.
For $k=0, 1, \ldots$:
1. Sample a trajectory $\tau = (s_0, a_0, r_0, s_1, \ldots, s_T)$ using $\pi_\theta$.
2. For $t=0, \ldots, T-1$:
  1. Compute $\hat{Q}_t^{(k)} = r_t + \gamma r_{t+1} + \cdots + \gamma^{k-1} r_{t+k-1} + \gamma^k V_\phi(s_{t+k})$.
  2. Update $\theta \leftarrow \theta + \alpha_\theta (\hat{Q}_t^{(k)} - V_\phi(s_t)) \nabla_\theta \log \pi_\theta(a_t|s_t)$.
  3. Update $\phi \leftarrow \phi - \alpha_\phi (\hat{Q}_t^{(k)} - V_\phi(s_t)) \nabla_\phi V_\phi(s_t)$.

6.11. Stop-gradient Operator for $g_\theta$

For ease of practical implementation, use the stop-gradient operator for $g_\theta$:

$$g_\theta = (\overline{\hat{Q}_t^{(k)}} - \overline{V_\phi(s_t)}) \nabla_\theta \log \pi_\theta(a_t|s_t)$$

Note that sampling of $r_0, \ldots, r_{T-1}$ and $s_1, \ldots, s_T$ does depend on $\pi_\theta$. Therefore, it is dubious to claim $\partial\hat{Q}/\partial\theta = 0$ and $\partial V_\phi(s_t)/\partial\theta = 0$. It is best to avoid these derivatives by through the stop-gradient operator.

6.12. The $\tilde{\gamma}$-trick

When $\gamma = 1$,

$$\sum_{t=0}^{T-1} G_t(\tau) \nabla_\theta \log \pi_\theta(a_t|s_t) = \sum_{t=0}^{T-1} (G_T(\tau) - b(s_t)) \nabla_\theta \log \pi_\theta(a_t|s_t)$$

where $G_T(\tau)$ is the total undiscounted return from the trajectory.
Let $\tilde{\gamma} < 1$ but $\tilde{\gamma} \approx 1$. Let $\hat{Q}_t^{\tilde{\gamma}}$ be a random variable such that

$$\hat{Q}_t^{\tilde{\gamma}} = r_t + \tilde{\gamma} r_{t+1} + \cdots + \tilde{\gamma}^{T-t-1} r_{T-1}$$

Then

$$\nabla_\theta \mathcal{J}(\theta) \approx \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} \left[ \sum_{t=0}^{T-1} (\hat{Q}_t^{\tilde{\gamma}} - V_\phi(s_t)) \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$$

Introducing the artificial discount factor $\tilde{\gamma}$ (not part of the MDP) introduces bias in the gradient estimates but can reduce the variance.
Most deep RL setups consider the undiscounted problem (so $\gamma = 1$) but introduces the artificial discount factor (so $\tilde{\gamma} < 1$).
For notational simplicity, we will not distinguish $\gamma$, the discount factor of the MDP, from $\tilde{\gamma}$, the artificial discount factor, and write $\gamma = \tilde{\gamma}$.

6.13. Policy Gradient Algorithm #2

Initialize $\theta$ and $\phi$.
For $k=0, 1, \ldots$:
1. Sample a trajectory $\tau = (s_0, a_0, r_0, s_1, \ldots, s_T)$ using $\pi_\theta$.
2. For $t=0, \ldots, T-1$:
  1. Compute $\hat{Q}_t^{(k)} = r_t + \gamma r_{t+1} + \cdots + \gamma^{k-1} r_{t+k-1} + \gamma^k V_\phi(s_{t+k})$.
  2. Update $\theta \leftarrow \theta + \alpha_\theta (\overline{\hat{Q}_t^{(k)}} - \overline{V_\phi(s_t)}) \nabla_\theta \log \pi_\theta(a_t|s_t)$.
  3. Update $\phi \leftarrow \phi - \alpha_\phi (\overline{\hat{Q}_t^{(k)}} - V_\phi(s_t)) \nabla_\phi V_\phi(s_t)$.

(no $\gamma$ factor here)

6.14. SGD with Non-Uniform Selection Rules

Let

$$g_i = \sum_{t=0}^{T_i-1} (G_t^{(i)} - V_\phi(s_t^{(i)})) \nabla_\phi V_\phi(s_t^{(i)})$$

where $N$ is fixed (non-random). If $i_k \sim \text{Uniform}(1, \ldots, N)$, then $g_{i_k}$ is a stochastic gradient (up to a factor of $1/N$) and

$$\phi_{k+1} = \phi_k - \alpha_k g_{i_k}$$

is an instance of SGD with unbiased stochastic gradients.
However, cyclic SGD

$$\theta_{k+1} = \theta_k - \alpha_k g_{\text{mod}(k,N)+1}$$

which gradient selection $g_1, \ldots, g_N, g_1, \ldots, g_N, \ldots$ is not an instance SGD with unbiased stochastic gradients. Nevertheless, cyclic SGD is commonly used in practice.

In the context of MDPs, we have $\tau \sim (p_0, \pi_\theta, p)$ with a random $T$. Then

$$\theta_{k+1} = \theta_k - \alpha_k (G_0(\tau) - b(s_0)) \nabla_\theta \log \pi_\theta(a_0|s_0)$$

is an instance of SGD with unbiased stochastic gradients, but it is an inefficient algorithm as only one of $g_0, \ldots, g_{T-1}$ is used in the update.

We can use all of $g_1, \ldots, g_T$ in the update:

$$\theta_{k+1} = \theta_k - \alpha_k \sum_{t=0}^{T-1} (G_t(\tau) - b(s_t)) \nabla_\theta \log \pi_\theta(a_t|s_t)$$

This is an instance of SGD with unbiased stochastic gradients, but it is inefficient as it updates infrequently. (In supervised learning, full-batch GD is less efficient than SGD.)

Finally,

$$\theta_{k+1} = \theta_k - \alpha_k (G_t(\tau) - b(s_t)) \nabla_\theta \log \pi_\theta(a_t|s_t)$$

is not an instance of SGD, but it updates the policy frequently and therefore is efficient despite using biased gradients.

6.15. Advantage Actor-Critic (A2C)

A2C Diagram

Mnih et al. published the asynchronous advantage actor-critic (A3C) method. A2C is a version of A3C without the multiple CPU cores performing gradient updates asynchronously.

Action space $\mathcal{A}$ may be discrete or continuous. Further details through examples.

Reference: V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, ICML, 2016.

6.16. A2C in Atari 2600

Atari 2600

Atari 2600 is a video game console released in 1977.
Bellemare et al. created an emulator environment to train and evaluate RL models on 55 different Atari games.
(Technically, using these Atari games for RL research is against copyright law as US Code Title 17, Chapter 3, Sec. 302, stipulates that the copyright of these games apply 70 years after the author's death.)

Reference: M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 2013.

Atari 2600 action space
18 possible actions:
$\uparrow, \leftarrow, \cdot, \rightarrow, \downarrow, \text{no press, button press}$
(Not all games use all 18 actions.)
60 frames per second, so up to 60 actions per second.
Standard trick: Agent repeats the same action for 4 frames, i.e., 15 actions per second.

Atari 2600 preprocessing
Remove flickering (artifact caused by the limited number of sprites Atari 2600 can display at once), make black and white, and down-scale the image resolution.
Stack 4 most recent frames. This allows the information to contain current velocity.
Note that Atari games are actually partially observable Markov decision processes (POMDP) and the game screen does not fully capture the game state. Moreover, the screen preprocessing means the agent (policy) receives only partial information.
These issues are not addressed (ignored) in the initial A3C or DQN papers.

6.17. A2C Architecture

Architecture of Mnih et al.:

Frames preprocessed into: 4x84x84
16 channel, 8x8 conv, stride 4, ReLU: 16x20x20
23 channel, 4x4 conv, stride 2, ReLU: 23x9x9
FC 256, ReLU
FC $\mathcal{A}$, softmax
$\mathcal{A}$ is the set of valid actions. Between 4 and 18 for the games.
Softmax makes the output probabilities over actions.

In finite-action-space deep RL, deep NN usually has an output for each $a \in \mathcal{A}$ (as opposed to action $a \in \mathcal{A}$ being an input to the NN).

Reference: V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, ICML, 2016.

6.18. A2C for Discrete Action Spaces

Form a neural network $f_\theta : \mathcal{S} \to \Delta_{\mathcal{A}}$, where
$\Delta_d = \{ p \in \mathbb{R}^d | p_1, \ldots, p_d \ge 0, p_1 + \cdots + p_d = 1 \}$
is the probability simplex of dimension $d$. For notational simplicify, assume actions $a_1, \ldots, a_{\mathcal{A}} \in \mathcal{A}$ are all integers. There are 2 steps in A2C to clarify.

Step 1: Evaluate $f_\theta(s_t)$ and sample $a_j$ with probability $f_\theta(s_t)_j$ for $j = 1, \ldots, \mathcal{A}$.
Step 2: Backprop on $\log f_\theta(s_t)_{a_t}$.

6.19. A2C for Continuous Action Spaces (MuJoCo)

MuJoCo (Multi-Joint dynamics with Contact) is a general purpose physics engine that simulates articulated structures.

State is the generalized coordinates and velocity of the joints.
Action represents forces exerted on the joints and is continuous.
In robotics applications, it is common for an action to correspond to a controller, and the controller has maximum and minimum inputs bounds.

Reference: E. Todorov, T. Erez, and Y. Tassa, MuJoCo: A physics engine for model-based control, IROS, 2012.

A2C applies to the setup with continuous actions. There are 2 steps in A2C to clarify.
As a concrete instance, let $\mathcal{A} \subseteq [-1, +1]$ be a continuous action space. Form a neural network $f_\theta : \mathcal{S} \to \mathbb{R}^2$ and write
$f_\theta(s) = (\mu_\theta(s), \tau_\theta(s))$

Step 1: Sample $a_t = \tanh(z_t)$ with $z_t \sim \mathcal{N}(\mu_\theta(s_t), \text{variance} = e^{2\tau_\theta(s_t)})$.
Step 2: Backpropagate on

$$\log \pi_\theta(a_t|s_t) = -\tau_\theta(s) - \frac{(\tanh^{-1}(a_t) - \mu_\theta(s_t))^2}{2e^{2\tau_\theta(s)}} + C$$

(Derivation in Hw. Follows from change of variable formula for probability density functions.)

6.20. MuJoCo Architecture

For a MuJoCo environment, $\mathcal{S} \subseteq \mathbb{R}^m$ with $m$ ranging from 4 to 100 and $\mathcal{A} \subseteq \mathbb{R}^n$ with $n$ being in a similar range.
The neural network
$f_\theta : \mathcal{S} \to \mathbb{R}^{2d}$
can be a simple ReLU MLP with $\sim 4$ layers.

6.21. Sample Efficiency

In ML, sample efficiency refers to the method’s ability to learn with fewer data points.
For RL in simulated environments (e.g. Atari 2600) sample efficiency is not a concern and only compute efficiency matters.
In many RL setups, however, samples (full trajectories or $(s_t, a_t)$ pairs) are obtained by interacting with the environment, and this can be expensive. In such cases, we prefer methods with good sample efficiency.
E.g. having a physical robot take certain actions.
E.g. an LLM writing a completion and having a human provide feedback on whether the completion is good.

6.22. Surrogate Objectives and Trust Regions

Can we learn more from an episode?
A2C uses an episode $\tau$ to perform $T$ SGD updates. Can we learn more from $\tau$?
Can we be more sample efficient?
We do so via the surrogate objective.
Let $\theta$ and $\theta_0$ be the policy parameters. Consider $\mathcal{J}(\theta)$ relative to $\mathcal{J}(\theta_0)$:

$$\mathcal{J}(\theta) - \mathcal{J}(\theta_0) = \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} [G_0(\tau)] - \mathbb{E}_{\tau \sim (p_0, \pi_{\theta_0}, p)} [G_0(\tau)]$$ $$ = \mathbb{E}_{\tau \sim (p_0, \pi_{\theta_0}, p)} \left[ \frac{p(\tau|\theta)}{p(\tau|\theta_0)} G_0(\tau) \right] - \mathbb{E}_{\tau \sim (p_0, \pi_{\theta_0}, p)} [G_0(\tau)]$$ $$ = \mathbb{E}_{\tau \sim (p_0, \pi_{\theta_0}, p)} \left[ \left( \prod_{t=0}^{T-1} \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_0}(a_t|s_t)} - 1 \right) G_0(\tau) \right]$$

Surrogate objective derivation
Then,

$$\mathcal{J}(\theta) - \mathcal{J}(\theta_0) = \mathbb{E}_{\tau \sim (p_0, \pi_{\theta_0}, p)} \left[ \sum_{t=0}^{T-1} \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_0}(a_t|s_t)} A^{\pi_{\theta_0}}(s_t, a_t) \right]$$

So,

Let

$$\mathcal{K}(\theta; \theta_0) = \mathbb{E}_{\tau \sim (p_0, \pi_{\theta_0}, p)} \left[ \sum_{t=0}^{T-1} \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_0}(a_t|s_t)} A^{\pi_{\theta_0}}(s_t, a_t) \right]$$

Then, for $\theta \approx \theta_0$.
$\mathcal{J}(\theta) \approx \mathcal{K}(\theta; \theta_0) + C$. Here, $C$ represents constants independent of $\theta$.
In fact, $\nabla_\theta \mathcal{J}(\theta_0) = \nabla_\theta \mathcal{K}(\theta_0; \theta_0)$, although $\mathcal{J}(\theta) \ne \mathcal{K}(\theta; \theta_0)$.

Surrogate is accurate up to 1st order
Our prior analysis for the policy gradient methods gave us

$$\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} \left[ \sum_{t=0}^{T-1} A^{\pi_\theta}(s_t, a_t) \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$$

Differentiating under the expectation gives us

$$\nabla_\theta \mathcal{K}(\theta; \theta_0) = \mathbb{E}_{\tau \sim (p_0, \pi_{\theta_0}, p)} \left[ \sum_{t=0}^{T-1} \frac{\nabla_\theta \pi_\theta(a_t|s_t)}{\pi_{\theta_0}(a_t|s_t)} A^{\pi_{\theta_0}}(s_t, a_t) \right]$$

Therefore, $\nabla_\theta \mathcal{J}(\theta_0) = \nabla_\theta \mathcal{K}(\theta_0; \theta_0)$.

For $\theta \approx \theta_0$, if we sample IID trajectories $\tau^{(1)}, \ldots, \tau^N \sim (p_0, \pi_{\theta_0}, p)$, then

$$\mathcal{J}(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T_i-1} \frac{\pi_\theta(a_t^{(i)}|s_t^{(i)})}{\pi_{\theta_0}(a_t^{(i)}|s_t^{(i)})} \hat{A}_t^{(i)}$$

where $\hat{A}_t \approx A^{\pi_{\theta_0}}(s_t, a_t)$ is an advantage estimate. The approximation $(*)$ is accurate when $N$ is large and when $\pi_\theta \approx \pi_{\theta_0}$. (In general, importance sampling estimators become inaccurate when the sampling distribution $\pi_{\theta_0}$ is too far from the nominal distribution $\pi_\theta$.)

We therefore maximize the surrogate objective subject to a certain trust-region constraint:

$$\max_\theta \mathcal{K}(\theta; \theta_0) \quad \text{s.t.} \quad D_{\text{KL}}(\pi_{\theta_0} || \pi_\theta) \le \delta$$

(The trust-region constraint is needed for 2 reasons: $\mathcal{K}(\theta; \theta_0) \approx \mathcal{J}(\theta)$ and $(*)$.)

6.23. Trust-Region Policy Optimization (TRPO)

Trust-region policy optimization (TRPO) solves a sequence of trust-region optimization problems to improve the policy.
The trust region is defined by the KL-divergence of the policies.
Note that $\hat{A}_t = \hat{Q}_t - V_\phi(s_t)$ depends on $\theta_{\text{curr}}$ through $\hat{Q}_t$ and $\phi$ through $V_\phi$. (We discuss the choice of $\hat{A}_t$ soon.)
The “solve” involves performing an approximate Newton method (which resembles a natural gradient method) with $H^{-1}g$ solved via a conjugate gradient (CG) solver. (We skip the details.)

Reference: J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, Trust region policy optimization, ICML, 2015.

6.24. Proximal Policy Optimization (PPO)

Implementing TRPO hard work due to the trust-region formulation. Also, it is unclear whether the 2nd-order optimization of TRPO is efficient, since most deep learning formulations are optimized via 1st-order optimization algorithms (SGD).
Proximal policy optimization (PPO) returns to a 1st-order formulation while keeping the trust-region idea.
The PPO paper presents “PPO-Penalty” and “PPO-Clip”. We will talk about the simpler PPO-Clip.
(PPO-Penalty uses a penalty, rather than a constrained, version of TRPO with the KL-divergence added to the objective as a regularizer.)

Reference: J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal policy optimization algorithms, arXiv, 2017.

Clipped surrogate objective
The clipped surrogate objective in PPO is

$$\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi_{\theta_k}, p)} \left[ \sum_{t=0}^{T-1} \min \left( \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)} \hat{A}_t, \text{clip}\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right) \hat{A}_t \right) \right]$$

Interpretation: We increase/maximize only by a small factor. This removes the incentive to move $\theta$ far away from $\theta_k$.
The loss is equivalent to

$$\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi_{\theta_k}, p)} \left[ \sum_{t=0}^{T-1} \left( \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)} \hat{A}_t - \max\left(0, \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)} \hat{A}_t - (1+\epsilon)\hat{A}_t \right) - \max\left(0, (1-\epsilon)\hat{A}_t - \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)} \hat{A}_t \right) \right) \right]$$

Use the clipped loss
The optimization subproblem is

$$\max_\theta \mathcal{L}^{\text{CLIP}}(\theta)$$

solved with first-order methods like SGD or Adam with early stopping.
The trust-region constraint is implicitly enforced by the clipping (no motivation to improve too much) and by performing few SGD iterations.
(We discuss the choice of $\hat{A}_t$ soon.)

PPO: Discussion
In Schulman et al., $\epsilon = 0.2$ is used.
In the maximization objective, $\mathcal{L}^{\text{CLIP}}$ is viewed as loss formed with $NT$ data points, where $T$ is the average of $T_i$. The maximization performs $\approx 10$ epochs over the $NT$ data points.
Strictly speaking, TRPO and PPO are not policy gradient methods, but they can be viewed as variants/enhancements of deep policy gradient methods.

Reference: J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal policy optimization algorithms, arXiv, 2017.

6.25. Bias-Variance Tradeoff

Bias-variance tradeoff of $\hat{A}_t$
Assume, for the sake of argument, we have access to $V^{\pi_{\text{curr}}}$. You will show in hw that

$$G_t, \quad r_t + \gamma V^{\pi_{\text{curr}}}(s_{t+1}), \quad r_t + \gamma r_{t+1} + \gamma^2 V^{\pi_{\text{curr}}}(s_{t+2}), \quad \ldots$$

all have the as the same mean, but the variance reduces (Rao–Blackwell) with smaller $k$.

In practice, we replace $V^{\pi_{\text{curr}}}$ with $V_\phi$
The estimators with $k < \infty$ are no longer unbiased.

Although there is no precise analysis, we still expect the variance to reduce with smaller $k$.
We now have a bias-variance tradeoff with $k$:

Small $k$: Large bias but small variance.
Large $k$: Small bias but large variance.

What about the dependency on $\gamma$? Consider using an artificial discount factor $\tilde{\gamma}$ for an undiscounted MDP:

$$\hat{Q}_t^{\tilde{\gamma}} = r_t + \tilde{\gamma} r_{t+1} + \cdots + \tilde{\gamma}^{T-t-1} r_{T-1}$$

In terms of expectations, $\tilde{\gamma} = 1$ is the correct choice (if $k = \infty$ and $\gamma = 1$, then unbiased) and using $\tilde{\gamma} < 1$ introduces bias. With $\tilde{\gamma} < 1$, however, the later rewards contribute less and they contribute less towards the variance.
We now have a bias-variance tradeoff with $\tilde{\gamma}$:

Small $\tilde{\gamma}$: Large bias but small variance.
Large $\tilde{\gamma}$: Small bias but large variance.

6.26. Generalized Advantage Estimation (GAE)

One challenge with tuning $k$ is that it cannot be continuously tuned. (What if you want to use $k$ larger than 4 and smaller than 5?)
Generalized Advantage Estimation (GAE) uses an exponentially weighted average of all $\hat{A}_t^{\text{TD}(k)}$ estimators based on an approach analogous to a classical technique called TD($\lambda$).

Reference: J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, High-dimensional continuous control using generalized advantage estimation, ICLR, 2016.

Generalized Advantage Estimation (GAE)
Given a $V(s)$ (meant to approximate $V^{\pi}(s)$), let

$$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$

Then, by a telescoping-sum argument, we have

$$\sum_{l=0}^{k-1} \gamma^l \delta_{t+l} = r_t + \gamma r_{t+1} + \cdots + \gamma^{k-1} r_{t+k-1} + \gamma^k V(s_{t+k}) - V(s_t)$$

Next, define the GAE estimator with $\lambda \in [0,1]$ as

$$\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{T-t-1} (\gamma \lambda)^l \delta_{t+l}$$

where $(*)$ is due to a geometric-sum argument. (HW)

Generalized Advantage Estimation (GAE) (cont.)
Note, if $\lambda = 0$, we recover the TD(1) estimator. If $\lambda = 1$, we recover the TD($\infty$) estimator.
If $V = V^{\pi}$, then GAE is an unbiased estimator. In homework, you will show that

$$\mathbb{E}[\hat{A}_t^{\text{GAE}(\gamma, \lambda)} | s_t, a_t] = Q^{\pi}(s_t, a_t) - V(s_t)$$

$$\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi_\theta, p)} \left[ \sum_{t=0}^{T-1} \hat{A}_t^{\text{GAE}(\gamma, \lambda)} \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$$

The $\lambda$ of GAE is a continuous parameter serving a similar role to the $k$ of TD.
When $V_\phi = V^{\pi}$, any value of $\lambda$ incurs no bias. (Unlike $\gamma$, which always causes a bias when $\gamma < 1$.) But when $V_\phi \ne V^{\pi}$, then large $\lambda$ allows the telescoping structure to reduce bias (when $\lambda = \gamma = 1$ there is no bias) while for small $\lambda$ the bias is larger.
We now have a bias-variance tradeoff with $\gamma$ and $\lambda$:

Small $\gamma$: Large bias but small variance.
Large $\gamma$: Small bias but large variance.
Small $\lambda$: Large bias but small variance.
Large $\lambda$: Small bias but large variance.

TRPO and PPO uses GAE for the advantage estimator $\hat{A}_t$.
A common choice of values: $\gamma = 0.995$ and $\lambda = 0.96$.

6.27. Policy Advantage and Policy Iteration

Define the policy advantage of $\pi$ over $\pi_k$ as

$$\mathcal{A}(\pi, \pi_k) = \mathbb{E}_{s \sim \rho^{\pi_k}} \left[ \sum_a \pi(a|s) A^{\pi_k}(s, a) \right]$$

Note that

$$\mathcal{A}(\pi, \pi_k) = \mathbb{E}_{s \sim \rho^{\pi_k}} \left[ \sum_a \pi(a|s) Q^{\pi_k}(s, a) - V^{\pi_k}(s) \right] = \mathbb{E}_{s \sim \rho^{\pi_k}} [V^{\pi}(s) - V^{\pi_k}(s)]$$

Interpretation: Follow a trajectory $\tau$ generated $\pi_k$, and we ask what actions $\pi$ would have chosen. $\mathcal{A}(\pi, \pi_k)$ quantifies how better, as measured by $A^{\pi_k}$, these actions are compared to the actions chosen by $\pi_k$.

Theorem) A policy $\pi_k$ is optimal if and only if $\mathcal{A}(\pi, \pi_k) = 0$ for all policies $\pi$.
Proof) XXX $\square$

PI as policy advantage maximization
We can equivalently express PI as policy advantage maximization.
Policy iteration (PI):

$$\pi_{k+1} = \arg\max_\pi \mathcal{A}(\pi, \pi_k)$$

(Strictly speaking, the equivalence requires that $\tau \sim (p_0, \pi_k, p)$ has positive probability to visit every state $s \in \mathcal{S}$.)
Proof) $\max_\pi \mathcal{A}(\pi, \pi_k)$ is maximized when $\pi(s) = \arg\max_{a \in \mathcal{A}} A^{\pi_k}(s, a)$. This is exactly PI. $\square$

6.28. Approximate Policy Iteration (TRPO and PPO)

Implementing PI with IS estimates
In practice, the expectation of $\mathcal{A}(\pi, \pi_k)$ has no closed form formula, so we approximate it with importance sampling (IS) estimates. We sample $\tau^{(1)}, \ldots, \tau^N \sim (p_0, \pi_k, p)$ IID trajectories and form the estimator:

$$\hat{\mathcal{A}}(\pi, \pi_k) = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^{T_i-1} \frac{\pi(a_t^{(i)}|s_t^{(i)})}{\pi_k(a_t^{(i)}|s_t^{(i)})} \hat{A}_t^{(i)}$$

using advantage estimators $\hat{A}_t^{(i)}$ satisfying $\mathbb{E}[\hat{A}_t^{(i)} | s_t^{(i)}, a_t^{(i)}] = A^{\pi_k}(s_t^{(i)}, a_t^{(i)})$.
When $N$ is large, we expect $\hat{\mathcal{A}}(\pi, \pi_k) \approx \mathcal{A}(\pi, \pi_k)$?

TRPO and PPO as approximate PI
We can interpret TRPO and PPO as an approximate policy iteration (PI). Consider the following approximate PI:

$$\pi_{k+1} = \arg\max_\pi \mathcal{K}(\pi; \pi_k)$$

where the argmax is computed approximately.
This fails because it requires $\pi_{\theta_{k+1}} \approx \pi_{\theta_k}$. In general, importance sampling estimators become inaccurate when the sampling distribution $\pi_{\theta_k}$ is not too far from the nominal distribution $\pi_{\theta_{k+1}}$. Therefore, a trust-region or proximal mechanism is needed, and this leads to TRPO as PPO.
(The first derivation covered earlier is the derivation from the TRPO and PPO papers. This is an alternate interpretation of the methods.)

6.29. Avoiding the Baseline Function

In TRPO and PPO, we must learn two neural networks: $\pi_\theta$ and $V_\phi$.
Learning the baseline function (also called the critic function) $V_\phi$ is an additional layer of complexity, and it would be better if we can avoid it.
But let us recall why $V_\phi$ is needed: We want to assign appropriate sign of the gradient signals based on whether the action is better than what the current policy prescribes.

PPO without baseline
Consider the undiscounted MDP with no $\gamma$-trick. Assume a non-zero reward is collected only at the end of the episode. In this case, $\hat{Q}_t^{\text{TD}(\infty)} = r$ for $t = 0,1, \ldots, T-1$, where $r = r_{T-1}$ is the reward obtained at the end, as the episode terminates.
Then, PPO without a baseline function would be:

$$\max_\theta \mathbb{E}_{\tau \sim (p_0, \pi_{\theta_k}, p)} \left[ \sum_{t=0}^{T-1} \min \left( \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)} r, \text{clip}\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right) r \right) \right]$$

Without the baseline, this will not work! Variance of the objective will be too large.

6.30. Group Relative Policy Optimization (GRPO)

Again consider the undiscounted MDP with no $\gamma$-trick. Assume a non-zero reward is collected only at the end of the episode.
Group Relative Policy Optimization (GRPO) replaces the baseline function with a BatchNorm-style normalization of the rewards.

$$\mathcal{L}^{\text{GRPO}}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi_{\theta_k}, p)} \left[ \sum_{t=0}^{T-1} \min \left( \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)} \hat{A}_t^{\text{GRPO}}, \text{clip}\left(\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)}, 1-\epsilon, 1+\epsilon\right) \hat{A}_t^{\text{GRPO}} \right) \right]$$

where $\hat{A}_t^{\text{GRPO}} = \frac{r_t - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r}) + \epsilon_{\text{std}}}$.

Reference: Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo, DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, arXiv, Feb. 2024.

Advantage estimate interpretation
The advantage $A^{\pi}(s_t, a_t)$ measures how good is action $a_t$ compared to the average action selected by $\pi$.
The GRPO advantage estimate is not an unbiased estimate of $A^{\pi}(s_t, a_t)$.
Qualitatively, however, $\hat{A}_i^{\text{GRPO}}$ measures similar information: The trajectory with action $a_t^{(i)}$ yielded reward $r^{(i)}$, and how good is this compared to the other actions selected by $\pi$, which yielded rewards $\mathbf{r}$?
DeepSeek’s GitHub uses $\epsilon = 10^{-4}$. If $\text{std}(\mathbf{r}) = 0$, then $\hat{A}_i^{\text{GRPO}} = 0$ and no update happens.

GRPO: Caveats
GRPO should not yet be considered a general deep RL method. Its effectiveness has not been tested outside of the LLM applications.
GRPO is not the first modern deep RL method that forgoes a baseline or critic function.
It is not obvious that GRPO is the best (or a good enough) algorithm to do RL without a baseline or critic model, even restricted to LLMs. GRPO got a lot of attention due to its use in training DeepSeek-R1, but better approaches may replace GRPO in future work. Improving upon GRPO will likely be an active area of research.

Reference: Z. Li, T. Xu, Y. Zhang, Z. Lin, Y. Yu, R. Sun, Z.-Q. Luo, ReMax: A simple, effective, and efficient reinforcement learning method for aligning large language models, ICML, 2024.

GRPO: More details
Consider this coverage a quick and incomplete introduction to GRPO.
We will revisit GRPO when we use it for LLMs. In that context we will incorporate a KL penalty and we will normalize trajectories with their sequence lengths.
GRPO also has a version for MDPs with rewards at intermediate steps.

7. MDP Basics II: Value Iteration

(Missing material added retroactively)

7.1. Value Iteration (VI)

V-value iteration:

$$V_{k+1}(s) = \max_a \sum_{r,s'} p(r,s'|s,a) [r + \gamma V_k(s')]$$

Q-value iteration:

$$Q_{k+1}(s, a) = \sum_{r,s'} p(r,s'|s,a) [r + \gamma \max_{a'} Q_k(s', a')]$$

Theorem) For $\gamma \in (0,1)$, both iterations converge with rate

$$||V_k - V^\star||_\infty \le \gamma^k ||V_0 - V^\star||_\infty$$

for $k = 0,1, \ldots$. (For $\gamma = 1$, convergence is not guaranteed.)
Proof) This follows from $\gamma$-contractiveness of $\mathcal{B}^\star$. $\square$
VI is usually not implemented in its exact form. But VI serves as a conceptual framework for designing and understanding many practical deep RL algorithms such as DQN.

7.2. Accelerated Value Iteration

Aside) VI is suboptimal, and it can be accelerated with anchored value iteration (AncVI):

$$U_{k+1} = \frac{1}{1+\beta_k} U_k + \frac{\beta_k}{1+\beta_k} \mathcal{B}^\star U_k$$

for $k = 0,1, \ldots$, where $\beta_k = \frac{1}{\sum_{i=0}^k \gamma^{-2i}}$ and $U_0$ is an initial point. $U_k = V_k$ or $U_k = Q_k$.

Theorem) If $U_0 \le \mathcal{B}^\star U_0$, then AncVI exhibits the rate

$$||U_k - U^\star||_\infty \le \frac{4}{k+1} \frac{1}{1-\gamma} ||U_0 - U^\star||_\infty$$

for $k = 0,1, \ldots$. This rate is faster than that of VI, and it matches a complexity lower bound up to a constant factor of 4 (and thus is optimal).
This result suggests that methods designed as approximations of vanilla VI may be theoretically suboptimal (in the sense of worst-case guarantees).

Reference: J. Lee and E. K. Ryu, Accelerating value iteration with anchoring, NeurIPS, 2023.

8. Policy Iteration

8.1. Policy Iteration Algorithm

Policy iteration (PI) is a classical algorithm that alternates policy evaluation and policy improvement steps.
Start with $\pi_0$.
For $k = 0,1, \ldots$:

Compute $V^{\pi_k}$ and/or $Q^{\pi_k}$ (Policy Evaluation)
Compute $\pi_{k+1}(s) = \arg\max_a Q^{\pi_k}(s, a)$ (Policy Improvement)

PI is usually not implemented in its exact form. But PI serves as a conceptual framework for designing and understanding many practical deep RL algorithms such as PPO.

8.2. Policy Improvement Theorem

Theorem) Consider the PI iteration. Assume $\gamma \in (0,1)$, $\mathcal{S} < \infty$, $\mathcal{A} < \infty$, and $r \le R < \infty$ almost surely. Then,

$$V^{\pi_{k+1}} \ge V^{\pi_k}$$

for $k = 0,1, \ldots$. Furthermore, there is a $K \in \mathbb{N}$ such that $V^{\pi_k} = V^\star$ and $\pi_k$ is an optimal policy for all $k \ge K$. (Policies become optimal after finitely many steps.)

Proof) To be covered XXX $\square$

9. Challenges with Undiscounted MDPs ($\gamma=1$)

It is instructive to understand when undiscounted ($\gamma = 1$) MDPs lead to pathologies.

Scenario 1: There is no problem. (There is an optimal policy $\pi^\star$ and $V^\star = \mathcal{B}^\star V^\star = V^{\pi^\star}$.)
This is the case when $\mathcal{S}$ and $\mathcal{A}$ are finite and $V^{\pi}$ is well defined and finite for all policy $\pi$.
E.g. If you only get a reward at the end of the episode (and never get any reward if the episode never terminates), then everything is well defined. The RL-LLM setting without KL-penalty falls under this case.

Reference: M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1994.

Scenario 2: You can get $+\infty$ reward by exploiting a cycle with positive reward.
This it not a problem with discounted MDPs; when $\gamma < 1$, even if the optimal policy exploits a positive-reward cycle, the optimal cumulative discounted return will be finite.

Infinite Mario Loop

One resolution is to consider the average-reward MDP formulation, instead of the undiscounted total return MDP. An average-reward MDP maximizes $R^{\pi}$:

$$R^{\pi} = \lim_{T \to \infty} \mathbb{E}_{\tau \sim (p_0, \pi, p)} \left[ \frac{1}{T} \sum_{t=0}^{T-1} r_t \right]$$

Analyzing average-reward MDPs tends to be more technical than the total return MDPs, and it is an active area of research.

References:
M. Zurek and Y. Chen, Span-based optimal sample complexity for weakly communicating and general average reward MDPs, NeurIPS, 2024.
J. Lee and E. K. Ryu, Optimal non-asymptotic rates of value iteration for average-reward Markov decision processes, ICLR, 2025.

Scenario 3: Total return (infinite sum) may not be well defined.
For example, if $r_0, r_1, r_2, r_3, r_4, r_5, \ldots = +1, -1, +1, -1, +1, -1, \ldots$, then the total return is not summable.
A reasonable remedy would be to consider a Cesàro (Abel) sum or the liminf of the partial sums, but this is a complication we will not go into.

10. Why PPO for RL-LLM?

Most RL-LLM methods use PPO or variants of PPO. Why not other choices?

Key properties of the RL-LLM setup:

LLMs have finite action space (# of possible tokens). Action space is not continuous.
A strong pre-trained large language model $\pi_\theta$, which serves as a suboptimal but reasonably good initial policy for the RL, is absolutely crucial. Tabula rasa RL methods (randomly initialized $\pi_\theta$) do not work.
We expect an RL method for RL-LLM to work only if it can effectively utilize the pre-trained LLM $\pi_\theta$.

Why not other deep RL algorithms? (DQN, Rainbow DQN)
DQN and Rainbow DQN are deep RL methods parameterizing $Q_\phi$ as a neural network and learning $Q_\phi \approx Q^\star$ through the Bellman optimality equation.
The policy $\pi_\phi$ is implicitly derived from the greedy rule $\pi_\phi(s) = \arg\max_{a \in \mathcal{A}} Q_\phi(s, a)$. There is no explicitly parameterized policy. The strength of $\pi_\phi$ hinges on the accuracy of $Q_\phi \approx Q^\star$.
There does not seem to be a good way for DQN to utilize a pre-trained policy.

References:
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, Playing Atari with deep reinforcement learning, arXiv, 2013.
M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, H. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, Rainbow: Combining improvements in deep reinforcement learning, AAAI, 2018.

Why not other deep RL algorithms? (DDPG, TD3, SAC)
DDPG, TD3, and SAC are methods with dual interpretations as deterministic policy gradient methods (qualitatively quite different from the stochastic policy gradient methods) and Q-learning methods. These methods are designed for continuous action spaces, although the variant ‘SAC discrete’ applies to discrete action spaces.
These methods train both $\pi_\theta$ and $Q_\phi$. Different from PPO, $\pi_\theta$ is not trained directly to maximize reward. Rather $\pi_\theta$ is trained to satisfy the relation $\pi_\theta(s) \approx \arg\max_{a \in \mathcal{A}} Q_\phi(s, a)$.
In PPO, rewards directly influence the updates of $\pi_\theta$. In DDPG, TD3, and SAC, rewards influence $Q_\phi$ which in turns influences $\pi_\theta$. So, the updates on $\pi_\theta$ depend on the rewards only through $Q_\phi$.
Without a good initialization for $Q_\phi \approx Q^\star$, a pre-trained $\pi_\theta$ alone seems difficult to utilize.

References:
S. Fujimoto, H. van Hoof, and D. Meger, Addressing function approximation error in actor-critic methods, ICML, 2018.
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, Continuous control with deep reinforcement learning, ICLR, 2016.
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, ICML (and arXiv v2), 2018.
P. Christodoulou, Soft actor-critic for discrete action settings, arXiv, 2019.

11. 2-Player Zero-Sum Games

In a 2-player zero-sum game, two players compete against each other and one player’s reward is precisely the loss of the other player.

In a simultaneous move game, the 2 players make their moves simultaneously. E.g., rock paper scissors.

In a sequential (turn-based) move game, the 2 players make their moves in turns. E.g., chess and go. However, we will think of sequential move games as simultaneous move games where the two players each offer policies $\pi_1$ and $\pi_2$ at the start of the game. (In real life game competitions, a human offers their brain at the start of the game.) These policies $\pi_1$ and $\pi_2$ can strategize and adapt to the opponents’ moves. They can also make randomized decisions.

11.1. Minimax Optimization

In a minimax optimization problem, we minimize with respect to one variable and maximize with respect to another:

$$\min_{\theta_1} \max_{\theta_2} R(\theta_1, \theta_2)$$

We say $\theta_1^\star, \theta_2^\star$ is a solution to the minimax problem if it is a Nash equilibrium:

$$R(\theta_1^\star, \theta_2^\star) \le R(\theta_1, \theta_2^\star) \quad \text{and} \quad R(\theta_1^\star, \theta_2^\star) \ge R(\theta_1^\star, \theta_2)$$

In other words, unilaterally deviating from $\theta_1^\star$ decreases the value of $R$ while unilaterally deviating from $\theta_2^\star$ increases the value of $R$.
Standard RL is posed as a maximization problem. However, adversarial training and two-player zero-sum games are posed as minimax optimization problems.

Example: Rock paper scissors
Consider the game of rock paper scissors with randomized strategies $p_{\theta_1}$ and $p_{\theta_2}$ and expected payoff $R(\theta_1, \theta_2)$:

$$R(\theta_1, \theta_2) = \sum_{i,j \in \{\text{rock, paper, scissors}\}} M_{ij} \mu(\theta_1)_i \mu(\theta_2)_j$$

where $\mu$ is the softmax function. Of course, the Nash equilibrium occurs at

$$\mu(\theta_1) = \mu(\theta_2) = (1/3, 1/3, 1/3)$$

11.2. Minimax Optimization with Gradients

In deep learning, we solve minimax optimization algorithms with first-order methods using stochastic gradients.
However, convergence of minimax optimization should not be taken for granted, and much more delicate care is needed than min optimization. The training of GANs is famously tricky. We will also see that the minimax training of the rock paper scissors examples is a highly unstable process.
RL training is already more unstable than supervised learning. Multi-agent RL (2-agent in our case) is even more delicate than non-RL minimax training.

Simultaneous gradient ascent-descent
Simultaneous gradient ascent-descent (SimGAD) is one of the simplest minimax optimization algorithms, but it fails to converge on rock paper scissors. In fact, SimGAD is expected to diverge on any zero-sum game.

$$\theta_1^{k+1} = \theta_1^k - \alpha \nabla_{\theta_1} R(\theta_1^k, \theta_2^k)$$ $$\theta_2^{k+1} = \theta_2^k + \alpha \nabla_{\theta_2} R(\theta_1^k, \theta_2^k)$$

Cycling dynamics where players countering the counter: Player 1 plays rock $\to$ Player 2 plays paper $\to$ Player 1 plays scissors $\to$ Player 2 plays rock $\to$ Player 1 plays paper $\to \cdots$

11.3. Extragradient (EG) Method

For the extragradient (EG) method, imagine you and the opponent make regular 1-step updates, evaluate the gradient from there, and use that gradient to commit to the update. Closely related to learning with opponent-level awareness (LOLA).
EG does converge for 2-player zero-sum games.

$$\tilde{\theta}_1^k = \theta_1^k - \alpha \nabla_{\theta_1} R(\theta_1^k, \theta_2^k)$$ $$\tilde{\theta}_2^k = \theta_2^k + \alpha \nabla_{\theta_2} R(\theta_1^k, \theta_2^k)$$ $$\theta_1^{k+1} = \theta_1^k - \alpha \nabla_{\theta_1} R(\tilde{\theta}_1^k, \tilde{\theta}_2^k)$$ $$\theta_2^{k+1} = \theta_2^k + \alpha \nabla_{\theta_2} R(\tilde{\theta}_1^k, \tilde{\theta}_2^k)$$

References:
G. M. Korpelevich, The extragradient method for finding saddle points and other problems, Ekonomika i Matematicheskie Metody, 1976.
J. N. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and I. Mordatch, Learning with opponent-learning awareness, AAMAS, 2018.

11.4. Anchoring and Weight Decay

Anchored simultaneous gradient ascent-descent adds the “anchor” mechanism that can also be understood as weight decay. (Anchor doesn’t have to be at 0, but weight decay shrinks toward 0.) This does converge on 2-player zero-sum games.

$$\theta_1^{k+1} = \theta_1^k - \alpha (\nabla_{\theta_1} R(\theta_1^k, \theta_2^k) + \lambda (\theta_1^k - \theta_1^0))$$ $$\theta_2^{k+1} = \theta_2^k + \alpha (\nabla_{\theta_2} R(\theta_1^k, \theta_2^k) + \lambda (\theta_2^k - \theta_2^0))$$

References:
E. K. Ryu, K. Yuan, and W. Yin, Ode analysis of stochastic gradient methods with optimism and anchoring for minimax problems, arXiv, 2019.
T. Yoon and E. K. Ryu, Accelerated algorithms for smooth convex-concave minimax problems with O(1/k^2) rate on squared gradient norm, ICML, 2021.

11.5. Antisymmetric Payoff Games

We say a 2-player zero-sum game has antisymmetric payoff if

$$R(\theta_1, \theta_2) = -R(\theta_2, \theta_1)$$

In this case, the Nash equilibrium at $\theta^\star, \theta^\star$ for some $\theta^\star$ and

$$\nabla_{\theta_1} R(\theta, \theta) = - \nabla_{\theta_2} R(\theta, \theta)$$

Also,

$$\nabla_{\theta_1} R(\theta, \theta) = \nabla_{\theta_2} R(\theta, \theta)$$

Therefore, SimGAD simplifies to

$$\theta^{k+1} = \theta^k - \alpha \nabla_{\theta_1} R(\theta^k, \theta^k)$$

Antisymmetric payoff games (cont.)
SimGAD with weight decay simplifies to

$$\theta^{k+1} = \theta^k - \alpha (\nabla_{\theta_1} R(\theta^k, \theta^k) + \lambda (\theta^k - \theta^0))$$

So a gradient ascent is done on player 1 with player 1 is playing against itself. (Player 2 is a copy of player 1.)

11.6. Chess and Go

Chess and go are 2-player zero-sum perfect information games with antisymmetric payoff.
Technically, the games are not perfectly (anti)symmetric, since one player moves first. (We can make game symmetric if first move is given to the players randomly.)
However, because the game is mostly symmetric, we will train one agent to play both players. So, there is only one policy $\pi_\theta$.
(In a highly asymmetric 2-player game, you would train 2 policies for each player.)

12. Recent Developments in LLMs

Shortly after the previous lecture, OpenAI announced its new memory features in ChatGPT. Other chatbot services will likely follow-up with similar features. LLMs now have long-term memory, and this will profoundly affect how we interact with LLMs.

Reference: https://x.com/polynoamial/status/1910379351759347860

13. Richard Sutton Wins 2025 Turing Award

In the first lecture of this course, I introduced Richard Sutton’s essay The Bitter Lesson. Shortly after this lecture, it was announced that Sutton was selected for the 2025 Turing Award, the highest academic honor in computer science.

Aside from his essay, Richard Sutton’s pioneering contributions to RL directly influenced much of the material covered in this course.

While our appreciation of Sutton’s work is not dependent on this award, I believe it is worthwhile to mention this recent development and retroactively acknowledge it here.

14. AlphaGo, Test-Time Compute, and Expert Iteration

14.1. Introduction to AlphaGo

Chess, shogi, and go
2-player perfect-information zero-sum turn-based games.

The top human chess player was defeated in 1997 by Deep Blue.
The top human shogi player was defeated in 2017 by Elmo.
The top human go player was defeated in 2016 by AlphaGo.

AlphaGo and AlphaGo Zero
AlphaGo and AlphaGo Zero combine search and learning to achieve super-human play.
These methods represent landmark scientific accomplishments in AI, and contain insights generalizable beyond the domains of games.
However, the exact techniques of MCTS has not yet been successfully adapted to the setup of LLM reasoning.

References:
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, Mastering the game of Go with deep neural networks and tree search, Nature, 2016.
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, Mastering the game of Go without human knowledge, Nature 2017.

14.2. AlphaGo Training Steps

AlphaGo training step 1: $\pi^{\text{IL}}_\theta$
Train policy $\pi^{\text{IL}}_\theta$ with imitation learning. From expert play record (기보), construct dataset
$\mathcal{D} = \{ (s, a) | \text{board position } s \text{ and next move } a \text{ from expert games} \}$
Then, train the policy to predict the expert players’ next move with loss

$$\mathcal{L}(\theta) = \sum_{(s,a) \in \mathcal{D}} \ell_{\text{CE}}(\pi^{\text{IL}}_\theta(a|s), a)$$

Neural network architecture is a 13-layer conv net. (Architecture is later improved in AlphaGo Zero.)
$\pi^{\text{IL}}_\theta$ cannot yet beat expert humans.

AlphaGo training step 2: $\pi^{\text{RL}}_\theta$
Train policy $\pi^{\text{RL}}_\theta$ through self-play. (i) Initialize $\pi^{\text{IL}}_\theta = \pi^{\text{RL}}_\theta$. (ii) Play $\pi^{\text{RL}}_\theta$ vs. $\pi^{\text{RL}}_{\theta^-}$ where $\theta^-$ is an earlier version of the parameter $\theta$. Choosing $\theta^-$ from a collection of past values stabilize training. Let $z = \pm 1$ indicate whether $\pi^{\text{RL}}_\theta$ won. (iii) Perform undiscounted policy gradient update on

$$\mathcal{J}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi^{\text{RL}}_\theta, p)} [z]$$

where first move (black or white stone) is randomized. Obtain an unbiased estimate of $\nabla_\theta \mathcal{J}(\theta)$ with the deep policy gradient method gradient

$$\nabla_\theta \mathcal{J}(\theta) = \mathbb{E}_{\tau \sim (p_0, \pi^{\text{RL}}_\theta, p)} \left[ z \sum_{t=0}^{T-1} \nabla_\theta \log \pi^{\text{RL}}_\theta(a_t|s_t) \right]$$

Here, the $(s_t, a_t)_{t=0}^{T-1}$ is the board states provided to and actions taken by $\pi^{\text{RL}}_\theta$.
(Of course, $\pi^{\text{RL}}_\theta$ should not be penalized or rewarded by actions taken by $\pi^{\text{RL}}_{\theta^-}$.)

Same neural network architecture as $\pi^{\text{IL}}_\theta$.
$\pi^{\text{RL}}_\theta$ vs $\pi^{\text{IL}}_\theta$ wins 80%, but still cannot defeat human experts.
In principle, if NN is very large and self-play is done for very long, then $\pi^{\text{RL}}_\theta$ should converge to perfection (and beat all humans.)
Under practical compute limitations, we need something more.

14.3. AlphaGo Training Step 3: Value Network $V_\phi$

Train value network $V_\phi \approx V^{\pi^{\text{RL}}_\theta} \approx V^\star$ use standard Monte Carlo policy evaluation method.
Self-play with the strongest policy $\pi^{\text{RL}}_\theta$, and perform stochastic gradient descent with

$$\mathcal{L}(\phi) = \mathbb{E}_{\tau \sim (p_0, \pi^{\text{RL}}_\theta, p)} \left[ \sum_{t=0}^{T-1} (z - V_\phi(s_t))^2 \right]$$

What about $Q$? Should we also learn $Q_\phi \approx Q^{\pi^{\text{RL}}_\theta} \approx Q^\star$?
Note, the dynamics is deterministic, with the transitioned state $s'$ given by a function $f : \mathcal{S} \times \mathcal{A} \to \mathcal{S}$. So with $s' = f(s, a)$, we have

$$Q^\star(s, a) = -V^\star(f(s, a))$$

where the negative sign reflects the fact that the turn passes to the opponent, and both the Q- and V-value functions are defined with respect to the player whose turn it is to move.

14.4. AlphaGo Training Step 4: Fast Rollout Policy $\pi^{\text{fast}}_\psi$

Train a faster rollout policy $\pi^{\text{fast}}_\psi$ with imitation learning.
Because $\pi^{\text{fast}}_\psi$ uses a lightweight architecture, it has significantly faster inference time: $\sim 2$ microseconds for $\pi^{\text{fast}}_\psi$ for compared to $\sim 3$ milliseconds for $\pi^{\text{RL}}_\theta$.
So more than 1000x faster.
Not a strong policy, but it does understand the most basic principles.

14.5. Neural-Net-Only Play

We have two reasonable options for neural-net-only play.
Option 1. Play with $\pi^{\text{RL}}_\theta$.
Option 2. Play with the greedy policy $\arg\max_{a \in \mathcal{A}} Q_\phi(s, a)$.
In principle, raw neural nets could beat humans with an exorbitant amount of compute, and we can estimate how much this would be. (More on this later.) However, under practical compute limitations, pure neural-net-based play is not enough.

14.6. Pure Tree Search

Consider all possible future board states that can be played. This approach is called minimax tree search, since I try to maximize my reward and my opponent tries to minimize my reward.
At every board state, the optimal action is

$$\arg\max_{a \in \mathcal{A}} Q^\star(s, a) = \arg\min_{a \in \mathcal{A}} V^\star(f(s, a))$$

Without learning, we don’t know $V^\star(s)$ for most states. However, if $s$ is terminal, then we know $V^\star(s)$ based on who won the game. So, expand the tree until the game ends, and recursively backtrack (dynamic programming) to find moves that take you to a winning board state.
This pure search-based method does not require any learning. However, this strategy is infeasible for moderately sized games because the computation size exponentially blows up.

14.7. Human Thinking: Systems 1 and 2

It is instructive to reflect on how we humans play.
Human players have intuition on a set of reasonable moves and how advantageous a board state is. This intuition is corresponds to system 1 thinking and is analogous to what neural networks learn.
Humans do not immediately act on these instincts, and instead deliberate through the future ramifications of moves. This deliberation corresponds to system 2 thinking and it analogous to search.
Human deliberation is not exhaustive. We do not consider all possible actions (limited width), and we do not mentally simulate until the end of the game (limited depth).
Alphago performs neural guided search through MCTS; Use neural networks to guided and focus the search on the relevant region of the game space.

14.8. Monte Carlo Tree Search (MCTS)

Monte Carlo Tree Search (MCTS) performs a neural guided search, selectively exploring parts of the tree based on the guidance of neural networks.

Principle #1: Gradually build up the tree, making it wider and deeper we exhaust the given computational budget.

Principle #2: Control the width of the tree by only considering “good” actions, defined as actions $a \in \mathcal{A}$ such that
(a) $\pi^{\text{IL}}_\theta(a|s)$ or $\pi^{\text{RL}}_\theta(a|s)$ is high (actions that $\pi_\theta$ would play)
(b) $Q_\phi(s, a)$ is high (actions that $Q_\phi$ thinks is good)
(c) we have not yet deliberated on (consider a set of actions instead of focusing on the presumed top choice).
Solution: At every node $s_t$, choose the action $a_t$ based on

$$a_t = \arg\max_a \left( Q_\phi(s_t, a) + \rho \pi_\theta(a|s_t) \frac{\sqrt{\sum_a N(s_t, a)}}{1 + N(s_t, a)} \right)$$

where $\rho > 0$ is a hyperparameter, $\pi = \pi^{\text{IL}}_\theta$ or $\pi = \pi^{\text{SL}}_\theta$, and $N(s_t, a)$ is the number of times the action has already been considered in the tree search.

Principle #3: Starting at $s_t$ consider a sequence of states based on the actions selected as previously described:
$s_t \mapsto s_{t+1} \mapsto s_{t+2} \mapsto \cdots \mapsto s_L$
We will have multiple leaf nodes $s_L$, and $s_L$ will likely not reach the end of the game. (Lookahead is $\sim 30$ moves, i.e., $\sim 60$ plies, deep.) So, truncate the depth of the search by approximating $V^\star(s_L)$ in the following two ways:
(a) Evaluate value network $V^\star(s_L) \approx V_\phi(s_L)$
(b) Play $N$ rollouts (until the end of the game) starting from $s_L$ using the fast policy $\pi^{\text{fast}}_\psi$:

$$G_L^{\text{rollout}} = \sum_{k=L}^{T-1} \gamma^{k-L} r_k$$

14.9. Summary of MCTS

Consider moves based on intuition encoded by $\pi^{\text{IL}}_\theta$ or $\pi^{\text{RL}}_\theta$ and $V_\phi$ and construct a lookahead tree.
Assess the assess the strength of the position $V_\phi$ and by quickly playing until the end of the game using $\pi^{\text{fast}}_\psi$.
Commit to the best action after this deliberation.

14.10. MCTS Bibliography

The Monte Carlo Tree Search (MCTS) of AlphaGo is really the classical MCTS + additional enhancements.
The ideas of MCTS was first described in

B. Abramson. The Expected-Outcome Model of Two-Player Games, Columbia University Ph.D. Thesis, 1987.

and was first applied to the game of go in

B. Brügmann, Monte Carlo Go, 1993.

The name “MCTS” was first coined in

R. Coulom, Efficient selectivity and backup operators in Monte-Carlo tree search, Computers and Games, 2006.

The action selection rule is called Upper Confidence bounds applied to Trees (UCT) in analogy to the UCB selection rule for the multi-armed bandit setting. UCT was proposed by

K. Levente and C. Szepesvári, Bandit based Monte-Carlo planning, European Conference on Machine Learning, 2006.

14.11. Why MCTS?

Elo Score Graph

The MCTS algorithm is complicated. Why not do pure RL and just use $\pi^{\text{RL}}_\theta$?

Without, $\pi^{\text{IL}}_\theta$, the RL training of $\pi^{\text{RL}}_\theta$ makes no progress.
Ablation studies show that rollouts (using $\pi^{\text{fast}}_\psi$), value network ($V_\phi$), and policy network ($\pi^{\text{IL}}_\theta$ or $\pi^{\text{RL}}_\theta$) are all necessary.

Reference: D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, Mastering the game of Go with deep neural networks and tree search, Nature, 2016.

14.12. Improving AlphaGo to AlphaGo Zero

Problem: Can we eliminate the reliance on imitation learning and human expert play data?
Technical improvements:

The neural network (NN) architecture of AlphaGo relied on design principles that were outdated by the time the AlphaGo paper was published. Use better architecture?
AlphaGo training does not use search. MCTS is employed at inference time to produce a stronger agent. Can we use this stronger agent during training?
These improvements led to AlphaGo Zero.

Reference: D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, Mastering the game of Go without human knowledge, Nature 2017.

Improved architecture of AlphaGo Zero
AlphaGo used 13-layer conv nets for $\pi$ and $V$.
AlphaGo Zero used a 40-layer convnet with BatchNorm and residual connections. Also, the policy $\pi^{\text{EI}}_\theta$ and value function $V_\theta$ are two heads with a shared base:
$(\pi^{\text{EI}}_\theta, V_\theta) = f(s)$
Sharing the base makes sense because the two networks $\pi^{\text{EI}}_\theta$ and $V_\theta$ would use similar features.
Takeaway: Neural network architecture matters.

References:
S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, ICML, 2015.
Kaiming He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, CVPR, 2016.

14.13. AlphaGo Zero Training: Expert Iteration

Perform MCTS with $\pi^{\text{EI}}_\theta$ and $V_\theta$. Perform self-play and improve $(\pi^{\text{EI}}_\theta, V_\theta) = f(s)$.
(No rollouts are used, so $\pi^{\text{fast}}_\psi$ is not needed.)

Expert iteration:

Self-play with MCTS = $(\pi^{\text{EI}}_\theta, V_\theta)$ for many games.
- Form action dataset $\mathcal{D}_a = \{(s_t, a_t)\}$
- Form win/loss dataset $\mathcal{D}_w = \{(s_t, z)\}$
Train $\pi^{\text{EI}}_\theta$ on $\mathcal{D}_a$ and $V_\theta$ on $\mathcal{D}_w$.

Policy $\pi^{\text{EI}}_\theta$ is trained to imitate the MCTS policy, which is stronger than $\pi^{\text{EI}}_\theta$.
Key insight of expert iteration: Given a NN, improve the policy with NN+search, and train the NN to mimic the improved policy. Can be thought of as extensions of imitation learning and policy iteration. Much faster learning compared to pure RL.

Reference: The name expert iteration was coined in concurrent work by Anthony, Tian, and Barber: T. Anthony, Z. Tian, and D. Barber, Thinking fast and slow with deep learning and tree search, NeurIPS, 2017.

14.14. AlphaGo Zero Results

The expert iteration of AlphaGo Zero significantly simplifies the approach of AlphaGo.
Results: Much stronger policy trained with no human expert data or any handcrafted human knowledge.

14.15. Computer Poker and Test-Time Scaling

The tale of computer poker

Cepheus. M. Bowling, N. Burch, M. Johanson, O. and Tammelin, Heads-up limit hold'em poker is solved, Science, 2015.
Libratus. Spectacular victory in public match against professional players in 2017. N. Brown and T. Sandholm, Superhuman AI for heads-up no-limit poker: Libratus beats top professionals, Science, 2018.
Pluribus. Spectacular victory in public match against professional players in 2019. N. Brown and T. Sandholm, Superhuman AI for multiplayer poker, Science, 2019.
- Jason Les: "very hopeless. You don't feel like there’s anything you can do to win.”
- Chris Ferguson: "Pluribus is a very hard opponent to play against. It's really hard to pin him down on any kind of hand.”
- Jimmy Chou: "Whenever playing the bot, I feel like I pick up something new to incorporate into my game."

Test-time scaling with computer poker
While the poker bots utilize learning, the capability is primarily due to search.
The search methodology is quite different from MCTS. How to carry out search is often domain specific.
By leveraging search (test-time compute), one can achieve capabilities that would otherwise require an exorbitant amount of train-time compute.

References:
N. Brown and T. Sandholm, Safe and nested subgame solving for imperfect-information games, NeurIPS, 2017.
N. Brown, Learning to Reason with LLMs, Talk at Simons Institute, Sept. 26, 2024.

14.16. Train-time vs. Test-time Compute

In fact, no raw neural network has yet to defeat top human players in go.
About 1000x train-time compute would be needed based on a back-of-the-envelope computation.

Reference: N. Brown, Learning to Reason with LLMs, Talk at Simons Institute, Sept. 26, 2024.

Mathematically, why does test-time compute work?
A pre-trained policy must handle all possible game states. Finding a perfect policy, therefore, amounts to solving the entire game upfront.
Tree search considers only the game states that can occur given the current state. I.e., test-time compute solves for a much smaller subset of the overall game.

14.17. Takeaways from Games

By leveraging test-time compute, one can spend extra compute on the given problem instance and deliberate to a degree that would be impossible in train-time.
In domains with verifiable answers (win or lose in the case of games), expert iteration can significantly accelerate training compared to pure RL.

15. Obituary: Daniel Kahneman

In the first lecture of this course, I introduced Daniel Kahneman’s work describing the System 1 and System 2 thinking of humans, and the notion of System 1 and System 2 thinking will come up again several times in this course. Since then, I have become aware of the following news.

Daniel Kahneman died by assisted suicide on March 27, 2024, three weeks after his 90th birthday, in Switzerland, though the manner and location of his death was only revealed in March 2025 (after the first lecture of this course was given). He wrote in an email:

“I have believed since I was a teenager that the miseries and indignities of the last years of life are superfluous, and I am acting on that belief. I am still active, enjoying many things in life (except the daily news) and will die a happy man. But my kidneys are on their last legs, the frequency of mental lapses is increasing, and I am ninety years old. It is time to go.”

Kahneman is most known and recognized for his work on the psychology of judgment and decision-making. (Slightly different subject from System 1 and System 2 thinking.) In the previous lecture, I may have led you to think that Kahneman is an active researches, so I want to correct that. Further, I feel that there is something profound about the last decision made by the world’s leading thinker on decisions.

Reference: The last decision by the world’s leading thinker on decisions, Wall Street Journal, March 14, 2025.

Deep Reinforcement Learning: Overview

Table of Contents

1. MDP Basics

1.1. Markov Decision Process Definition

1.2. MDP Base Definitions

1.3. State vs. Observation

1.4. MDP Generalizations

1.5. Terminal Time Notation

2. Imitation Learning and Behavior Cloning

2.1. Behavior Cloning

2.2. Distribution Shift in Behavior Cloning

2.3. Off-policy vs. On-policy Learning

2.4. DAgger (Dataset Aggregation)

2.5. Causal Confusion

3. MDP Objective and Value Functions

3.1. Maximizing Expected Discounted Return

3.2. State Value Function $V^\pi$ and State-Action Value Function $Q^\pi$

3.3. Preliminaries: Banach Fixed Point Theorem

3.4. Bellman Equation for $V^\pi$

3.5. Bellman Equation for $Q^\pi$

4. Optimal Policies and Value Functions

4.1. Optimal Policy and Value Functions

4.2. Bellman Optimality Equation for $V^\star$

4.3. Bellman Optimality Equation for $Q^\star$

5. Policy Evaluation vs. Optimization

5.1. Policy Evaluation: Monte Carlo (MC)

5.2. Policy Evaluation: Temporal Difference (TD)

5.3. Stop-Gradient Operator

5.4. Semi-Gradient Method

5.5. k-step TD Policy Evaluation

5.6. MC vs. TD (Bootstrapping)

5.7. $Q_\phi \approx Q^\pi$ with MC

5.8. $Q_\phi \approx Q^\pi$ with TD + (Approx) SGD

5.9. $Q_\phi \approx Q^\pi$ with k-step TD + (Approx) SGD

6. Deep Policy Gradient Methods

6.1. Policy Optimization

6.2. Policy Gradient Fundamentals

6.3. Stochastic Gradient and Variance Reduction

6.4. Enhancement #2: State-Dependent Baseline

6.5. Enhancement #3: Q-Estimates

6.6. Minimum-Variance Conditional Estimator

6.7. Interpretation via Advantage Estimation

6.8. Interpretation as an Actor-Critic Method

6.9. Approximating Q-Values and k-step TD

6.10. Policy Gradient Algorithm #1

6.11. Stop-gradient Operator for $g_\theta$

6.12. The $\tilde{\gamma}$-trick

6.13. Policy Gradient Algorithm #2

6.14. SGD with Non-Uniform Selection Rules

6.15. Advantage Actor-Critic (A2C)

6.16. A2C in Atari 2600

6.17. A2C Architecture

6.18. A2C for Discrete Action Spaces

6.19. A2C for Continuous Action Spaces (MuJoCo)

6.20. MuJoCo Architecture

6.21. Sample Efficiency

6.22. Surrogate Objectives and Trust Regions

6.23. Trust-Region Policy Optimization (TRPO)

6.24. Proximal Policy Optimization (PPO)

6.25. Bias-Variance Tradeoff

6.26. Generalized Advantage Estimation (GAE)

6.27. Policy Advantage and Policy Iteration

6.28. Approximate Policy Iteration (TRPO and PPO)

6.29. Avoiding the Baseline Function

6.30. Group Relative Policy Optimization (GRPO)

7. MDP Basics II: Value Iteration

7.1. Value Iteration (VI)

7.2. Accelerated Value Iteration

8. Policy Iteration

8.1. Policy Iteration Algorithm

8.2. Policy Improvement Theorem

9. Challenges with Undiscounted MDPs ($\gamma=1$)

10. Why PPO for RL-LLM?

11. 2-Player Zero-Sum Games

11.1. Minimax Optimization

11.2. Minimax Optimization with Gradients

11.3. Extragradient (EG) Method

11.4. Anchoring and Weight Decay

11.5. Antisymmetric Payoff Games

11.6. Chess and Go