Part I · Foundations Week 8 Published ppo.py test_ppo.py

Actor-Critic, GAE, PPO, and TRPO

Turning REINFORCE into a stable, sample-reusing optimizer: actor-critic with a learned baseline, generalized advantage estimation as a bias–variance dial, and trust regions (TRPO/PPO) as step-size control in policy space. Why the clipped surrogate works, and why implementation details decide the score.

On this page

Actor-critic
Generalized advantage estimation
Trust regions: TRPO and PPO
Implementation matters
The dynamic-programming bridge
What’s next
Exercises
Companion code

Actor-Critic, GAE, PPO, and TRPO

Where we are. REINFORCE (Chapter 7) was unbiased but high-variance and strictly on-policy: one noisy gradient step per batch of fresh trajectories, then throw the data away. This chapter turns it into the workhorse of modern on-policy RL by adding three things: a learned critic as the baseline (actor-critic), generalized advantage estimation to tune the advantage’s bias–variance, and a trust region (TRPO/PPO) that bounds how far each update moves the policy — which is what finally lets a batch be reused for several epochs. The roadmap’s framing is the through-line: trust regions are step-size control in policy space.

Chapter 8 — at a glance

Goal. Build actor-critic from Chapter 7’s advantage; derive the GAE recursion and its bias–variance dial; understand TRPO’s constrained step and PPO’s clipped surrogate as two ways to bound the policy update; and see why implementation details dominate reported performance.

Reading time. ~35 minutes; ~55 with the proof and exercises.

Key insight — the DP bridge. Actor-critic is generalized policy iteration (Chapter 1) with approximation: the critic is approximate policy evaluation, the gradient actor is approximate improvement. Policy iteration took the full greedy jump; here improvement is a small, trust-region-controlled step, because the advantage estimate is only trustworthy near the policy that produced the data. GAE’s $\lambda$ is the same bias–variance dial as $n$ -step TD (Chapter 4).

Actor-critic

Chapter 7 ended with the advantage form of the policy gradient, $\nabla_\theta J = \E[\sum_t \nabla_\theta\log\policy_\theta(A_t\mid S_t)\,\advantage(S_t,A_t)]$ . Actor-critic makes the baseline a learned function: an actor $\policy_\theta$ and a critic $\valuefn_\phi$ trained to predict returns, with the advantage estimated from the critic.

Advantage actor-critic (A2C) updates both together — the critic by regressing

\valuefn_\phi

toward the returns, the actor by ascending the score weighted by the critic’s advantage. Konda and Tsitsiklis Konda & Tsitsiklis (2000) established the two-timescale convergence of actor-critic with a linear critic; the critic plays the role of approximate policy evaluation in a sampled, function-approximated generalized policy iteration.

Generalized advantage estimation

How should the advantage be estimated? The one-step TD residual $\delta_t = R_{t+1} + \discount\valuefn_\phi(S_{t+1}) - \valuefn_\phi(S_t)$ is itself a low-variance, high-bias advantage estimate; the full Monte Carlo advantage $\return_t - \valuefn_\phi(S_t)$ is the reverse. Generalized advantage estimation Schulman et al. (2016) interpolates with an exponential weighting.

Proposition 8.1 (The GAE recursion).

The generalized advantage estimator

\advantage^{\text{GAE}(\discount,\lambda)}_t \defeq \sum_{l\ge 0}(\discount\lambda)^l\,\delta_{t+l}

satisfies the backward recursion

\advantage^{\text{GAE}}_t = \delta_t + \discount\lambda\,\advantage^{\text{GAE}}_{t+1}.

Its endpoints are $\lambda=0$ , giving $\advantage_t=\delta_t$ (low variance, high bias), and $\lambda=1$ , giving $\advantage_t=\sum_{l\ge0}\discount^l\delta_{t+l}=\return_t-\valuefn_\phi(S_t)$ (the Monte Carlo advantage: high variance, low bias).

Proof.

Split the defining sum at its first term:

\sum_{l\ge 0}(\discount\lambda)^l\delta_{t+l} = \delta_t + \sum_{l\ge 1}(\discount\lambda)^l\delta_{t+l} = \delta_t + \discount\lambda\sum_{l\ge 0}(\discount\lambda)^l\delta_{t+1+l} = \delta_t + \discount\lambda\,\advantage^{\text{GAE}}_{t+1}.

For $\lambda=1$ , $\sum_{l\ge0}\discount^l\delta_{t+l}$ telescopes: writing $\delta_{t+l} = R_{t+l+1}+\discount\valuefn_\phi(S_{t+l+1})-\valuefn_\phi(S_{t+l})$ , the value terms cancel in pairs, leaving $\sum_{l\ge0}\discount^l R_{t+l+1} - \valuefn_\phi(S_t) = \return_t-\valuefn_\phi(S_t)$ . $\qquad\blacksquare$

The recursion is what the companion computes in one backward pass.

Intermediate

\lambda

(commonly

\approx 0.95

) usually beats both endpoints — the same lesson as

n

-step TD, now applied to the advantage that weights the policy gradient.

Trust regions: TRPO and PPO

REINFORCE and A2C take one gradient step per batch because the advantage estimate is only valid near the policy that generated the data; a large step can collapse performance. The fix is to bound the update — to control the step size in policy space rather than parameter space.

TRPO Schulman et al. (2015) makes this literal: maximize the importance-weighted surrogate $\E[\rho_t\,\advantage_t]$ , with ratio $\rho_t = \policy_\theta(A_t\mid S_t)/\policy_{\text{old}}(A_t\mid S_t)$ , subject to a trust-region constraint $\E[\mathrm{KL}(\policy_{\text{old}}\,\|\,\policy_\theta)] \le \delta$ . The KL ball is the trust region; within it the linearized objective is reliable.

PPO Schulman et al. (2017) replaces the hard constraint with a cheaper clipped surrogate,

L^{\text{CLIP}}(\theta) = \E\big[\min\big(\rho_t\,\advantage_t,\ \mathrm{clip}(\rho_t, 1-\epsilon, 1+\epsilon)\,\advantage_t\big)\big].

The $\min$ makes $L^{\text{CLIP}}$ a pessimistic lower bound on the unclipped surrogate: when an advantage is positive, the objective stops rewarding increases in $\rho_t$ once $\rho_t > 1+\epsilon$ (its gradient there is zero); when negative, it stops once $\rho_t < 1-\epsilon$ .

Either way there is no incentive to push the policy far from

\policy_{\text{old}}

in a single update, so PPO can safely take several epochs of minibatch updates per batch — the sample reuse REINFORCE lacked.

Implementation matters

A sobering empirical fact closes the family. Engstrom et al. Engstrom et al. (2020) showed that much of PPO’s advantage over TRPO comes not from the clipped objective but from code-level details — advantage normalization, value-function clipping, reward scaling, learning-rate annealing, orthogonal initialization — applied alongside it. Together with the broader reproducibility findings of Henderson et al. Henderson et al. (2018) , the lesson is that on-policy deep RL results must be read with the implementation, not just the algorithm, in view. The companion therefore tests the components (the GAE recursion, advantage normalization, the clip) and checks learning on a simple task, rather than chasing a benchmark number.

The dynamic-programming bridge

Actor-critic completes the generalized-policy-iteration picture under approximation. The critic is approximate policy evaluation (Chapter 1’s $\valuefn_\policy$ , learned from samples); the gradient actor is approximate improvement. Where policy iteration took the full greedy jump to $\argmax_a \qfn$ , the actor takes a small step, and the trust region (TRPO’s KL ball, PPO’s clip) is exactly the step-size control that keeps improvement reliable when evaluation is noisy and local. Two bridges:

To continuous control (Week 9). Everything here is on-policy and stochastic; Week 9 turns to off-policy continuous control (DDPG, TD3, SAC), trading sample reuse-by-replay for the on-policy stability bought here by trust regions.
To optimal control (Part II). A trust region on the update is a regularized step — the policy-space analogue of a line search or a Levenberg–Marquardt damping in optimization, and a cousin of MPC’s receding horizon as a bound on how far ahead a single decision commits.

What’s next

Week 9 leaves on-policy methods for off-policy continuous control: the deterministic policy gradient (DDPG), its twin-critic fix (TD3), and maximum-entropy RL (SAC) — where the entropy bonus of Chapter 7 becomes the objective.

Exercises

(Derive) Derive the GAE recursion $\advantage^{\text{GAE}}_t = \delta_t + \discount\lambda\,\advantage^{\text{GAE}}_{t+1}$ from the exponential sum, and show $\lambda=1$ gives the Monte Carlo advantage (Prop. 8.1).

Solution
Split the sum at $l=0$ : $\sum_{l\ge0}(\discount\lambda)^l\delta_{t+l} = \delta_t + \discount\lambda\sum_{l\ge0}(\discount\lambda)^l\delta_{t+1+l} = \delta_t + \discount\lambda\advantage^{\text{GAE}}_{t+1}$ . At $\lambda=1$ the value terms in $\sum_l\discount^l\delta_{t+l}$ telescope to $\return_t-\valuefn_\phi(S_t)$ .
(Prove) Show $L^{\text{CLIP}}$ is a lower bound on the unclipped surrogate $\E[\rho_t\advantage_t]$ , and identify where its gradient with respect to $\rho_t$ is zero.

Solution
$\min(\rho\advantage, \mathrm{clip}(\rho)\advantage) \le \rho\advantage$ pointwise, so the expectation is a lower bound. For $\advantage>0$ the clip caps the term at $(1+\epsilon)\advantage$ once $\rho>1+\epsilon$ , where $\partial/\partial\rho = 0$ ; for $\advantage<0$ it floors at $(1-\epsilon)\advantage$ once $\rho<1-\epsilon$ , again with zero gradient. Inside $[1-\epsilon,1+\epsilon]$ the gradient is $\advantage$ .
(Compute) With $\discount=0.99$ , $\lambda=0.95$ , and TD residuals $\delta = (1.0, 0.5, -0.2)$ at the end of an episode (bootstrap zero), compute $\advantage^{\text{GAE}}$ .

Solution
Backward: $\advantage_2 = -0.2$ ; $\advantage_1 = 0.5 + 0.99\cdot0.95\cdot(-0.2) = 0.3119$ ; $\advantage_0 = 1.0 + 0.99\cdot0.95\cdot0.3119 = 1.2933$ . (The companion’s compute_gae reproduces these.)
(Implement) In the companion, verify the GAE recursion matches the exponential sum and that $\lambda=1$ equals returns minus values; that advantage normalization produces mean-0/unit-variance advantages; that the PPO clip flattens the objective outside $[1-\epsilon,1+\epsilon]$ ; and that PPO learns CartPole above the random baseline.

Solution
See experiments/python/week08/test_ppo.py: compute_gae vs the brute-force $\sum(\discount\lambda)^l\delta$ ; the $\lambda=1$ telescoping identity; the normalization statistics; the clip’s value/gradient outside the range; and a seeded PPO CartPole run clearing the ~22 random baseline.
(Extend) Sweep the PPO clip range $\epsilon$ and add KL early stopping; compare stability across seeds.

Solution
Smaller $\epsilon$ tightens the trust region (more stable, slower); larger loosens it (faster, riskier). KL early stopping halts the epoch loop once the policy has moved a target KL from $\policy_{\text{old}}$ , a direct enforcement of the trust region the clip only approximates — the companion’s --clip/--target-kl flags expose both.

Companion code

The Week-8 companion lives at experiments/python/week08/ and is PyTorch A2C and PPO on CartPole-v1, sharing one actor-critic network.

ppo.py — compute_gae (the backward recursion), an ActorCritic network, the ppo_clip_objective, and a PPO/A2C training loop with advantage normalization and a configurable clip range and epoch count.
test_ppo.py — component-correctness tests (compute_gae vs the closed-form exponential sum; the $\lambda=1$ = returns − values identity; advantage normalization; the PPO clip’s value and zero gradient outside $[1-\epsilon,1+\epsilon]$ ) plus a seeded PPO CartPole run learning above the random baseline.

# component tests + a seeded CartPole learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week08/test_ppo.py -q

# worked PPO training run on CartPole
PYTHONPATH=. python experiments/python/week08/ppo.py --updates 150