Actor-Critic, GAE, PPO, and TRPO
Turning REINFORCE into a stable, sample-reusing optimizer: actor-critic with a learned baseline, generalized advantage estimation as a bias–variance dial, and trust regions (TRPO/PPO) as step-size control in policy space. Why the clipped surrogate works, and why implementation details decide the score.
On this page
Actor-Critic, GAE, PPO, and TRPO
Where we are. REINFORCE (Chapter 7) was unbiased but high-variance and strictly on-policy: one noisy gradient step per batch of fresh trajectories, then throw the data away. This chapter turns it into the workhorse of modern on-policy RL by adding three things: a learned critic as the baseline (actor-critic), generalized advantage estimation to tune the advantage’s bias–variance, and a trust region (TRPO/PPO) that bounds how far each update moves the policy — which is what finally lets a batch be reused for several epochs. The roadmap’s framing is the through-line: trust regions are step-size control in policy space.
Actor-critic
Chapter 7 ended with the advantage form of the policy gradient, . Actor-critic makes the baseline a learned function: an actor and a critic trained to predict returns, with the advantage estimated from the critic. Advantage actor-critic (A2C) updates both together — the critic by regressing toward the returns, the actor by ascending the score weighted by the critic’s advantage. Konda and Tsitsiklis Konda & Tsitsiklis (2000) established the two-timescale convergence of actor-critic with a linear critic; the critic plays the role of approximate policy evaluation in a sampled, function-approximated generalized policy iteration.
Generalized advantage estimation
How should the advantage be estimated? The one-step TD residual is itself a low-variance, high-bias advantage estimate; the full Monte Carlo advantage is the reverse. Generalized advantage estimation Schulman et al. (2016) interpolates with an exponential weighting.
The generalized advantage estimator
satisfies the backward recursion
Its endpoints are , giving (low variance, high bias), and , giving (the Monte Carlo advantage: high variance, low bias).
Split the defining sum at its first term:
For , telescopes: writing , the value terms cancel in pairs, leaving .
The recursion is what the companion computes in one backward pass. Intermediate (commonly ) usually beats both endpoints — the same lesson as -step TD, now applied to the advantage that weights the policy gradient.
Trust regions: TRPO and PPO
REINFORCE and A2C take one gradient step per batch because the advantage estimate is only valid near the policy that generated the data; a large step can collapse performance. The fix is to bound the update — to control the step size in policy space rather than parameter space.
TRPO Schulman et al. (2015) makes this literal: maximize the importance-weighted surrogate , with ratio , subject to a trust-region constraint . The KL ball is the trust region; within it the linearized objective is reliable.
PPO Schulman et al. (2017) replaces the hard constraint with a cheaper clipped surrogate,
The makes a pessimistic lower bound on the unclipped surrogate: when an advantage is positive, the objective stops rewarding increases in once (its gradient there is zero); when negative, it stops once . Either way there is no incentive to push the policy far from in a single update, so PPO can safely take several epochs of minibatch updates per batch — the sample reuse REINFORCE lacked.
Implementation matters
A sobering empirical fact closes the family. Engstrom et al. Engstrom et al. (2020) showed that much of PPO’s advantage over TRPO comes not from the clipped objective but from code-level details — advantage normalization, value-function clipping, reward scaling, learning-rate annealing, orthogonal initialization — applied alongside it. Together with the broader reproducibility findings of Henderson et al. Henderson et al. (2018) , the lesson is that on-policy deep RL results must be read with the implementation, not just the algorithm, in view. The companion therefore tests the components (the GAE recursion, advantage normalization, the clip) and checks learning on a simple task, rather than chasing a benchmark number.
The dynamic-programming bridge
Actor-critic completes the generalized-policy-iteration picture under approximation. The critic is approximate policy evaluation (Chapter 1’s , learned from samples); the gradient actor is approximate improvement. Where policy iteration took the full greedy jump to , the actor takes a small step, and the trust region (TRPO’s KL ball, PPO’s clip) is exactly the step-size control that keeps improvement reliable when evaluation is noisy and local. Two bridges:
- To continuous control (Week 9). Everything here is on-policy and stochastic; Week 9 turns to off-policy continuous control (DDPG, TD3, SAC), trading sample reuse-by-replay for the on-policy stability bought here by trust regions.
- To optimal control (Part II). A trust region on the update is a regularized step — the policy-space analogue of a line search or a Levenberg–Marquardt damping in optimization, and a cousin of MPC’s receding horizon as a bound on how far ahead a single decision commits.
What’s next
- Week 9 leaves on-policy methods for off-policy continuous control: the deterministic policy gradient (DDPG), its twin-critic fix (TD3), and maximum-entropy RL (SAC) — where the entropy bonus of Chapter 7 becomes the objective.
Exercises
-
(Derive) Derive the GAE recursion from the exponential sum, and show gives the Monte Carlo advantage (Prop. 8.1).
Solution
Split the sum at : . At the value terms in telescope to .
-
(Prove) Show is a lower bound on the unclipped surrogate , and identify where its gradient with respect to is zero.
Solution
pointwise, so the expectation is a lower bound. For the clip caps the term at once , where ; for it floors at once , again with zero gradient. Inside the gradient is .
-
(Compute) With , , and TD residuals at the end of an episode (bootstrap zero), compute .
Solution
Backward: ; ; . (The companion’s
compute_gaereproduces these.) -
(Implement) In the companion, verify the GAE recursion matches the exponential sum and that equals returns minus values; that advantage normalization produces mean-0/unit-variance advantages; that the PPO clip flattens the objective outside ; and that PPO learns CartPole above the random baseline.
Solution
See
experiments/python/week08/test_ppo.py:compute_gaevs the brute-force ; the telescoping identity; the normalization statistics; the clip’s value/gradient outside the range; and a seeded PPO CartPole run clearing the ~22 random baseline. -
(Extend) Sweep the PPO clip range and add KL early stopping; compare stability across seeds.
Solution
Smaller tightens the trust region (more stable, slower); larger loosens it (faster, riskier). KL early stopping halts the epoch loop once the policy has moved a target KL from , a direct enforcement of the trust region the clip only approximates — the companion’s
--clip/--target-klflags expose both.
Companion code
The Week-8 companion lives at experiments/python/week08/ and is PyTorch A2C and
PPO on CartPole-v1, sharing one actor-critic network.
ppo.py—compute_gae(the backward recursion), anActorCriticnetwork, theppo_clip_objective, and a PPO/A2C training loop with advantage normalization and a configurable clip range and epoch count.test_ppo.py— component-correctness tests (compute_gaevs the closed-form exponential sum; the = returns − values identity; advantage normalization; the PPO clip’s value and zero gradient outside ) plus a seeded PPO CartPole run learning above the random baseline.
# component tests + a seeded CartPole learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week08/test_ppo.py -q
# worked PPO training run on CartPole
PYTHONPATH=. python experiments/python/week08/ppo.py --updates 150