Part I · Foundations Week 8 Published ppo.py test_ppo.py

Actor-Critic, GAE, PPO, and TRPO

Turning REINFORCE into a stable, sample-reusing optimizer: actor-critic with a learned baseline, generalized advantage estimation as a bias–variance dial, and trust regions (TRPO/PPO) as step-size control in policy space. Why the clipped surrogate works, and why implementation details decide the score.

On this page
  1. Actor-critic
  2. Generalized advantage estimation
  3. Trust regions: TRPO and PPO
  4. Implementation matters
  5. The dynamic-programming bridge
  6. What’s next
  7. Exercises
  8. Companion code

Actor-Critic, GAE, PPO, and TRPO

Where we are. REINFORCE (Chapter 7) was unbiased but high-variance and strictly on-policy: one noisy gradient step per batch of fresh trajectories, then throw the data away. This chapter turns it into the workhorse of modern on-policy RL by adding three things: a learned critic as the baseline (actor-critic), generalized advantage estimation to tune the advantage’s bias–variance, and a trust region (TRPO/PPO) that bounds how far each update moves the policy — which is what finally lets a batch be reused for several epochs. The roadmap’s framing is the through-line: trust regions are step-size control in policy space.

Actor-critic

Chapter 7 ended with the advantage form of the policy gradient, θJ=E[tθlogπθ(AtSt)A(St,At)]\nabla_\theta J = \E[\sum_t \nabla_\theta\log\policy_\theta(A_t\mid S_t)\,\advantage(S_t,A_t)]. Actor-critic makes the baseline a learned function: an actor πθ\policy_\theta and a critic vϕ\valuefn_\phi trained to predict returns, with the advantage estimated from the critic.

Advantage actor-critic (A2C) updates both together — the critic by regressing vϕ\valuefn_\phi toward the returns, the actor by ascending the score weighted by the critic’s advantage. Konda and Tsitsiklis Konda & Tsitsiklis (2000) established the two-timescale convergence of actor-critic with a linear critic; the critic plays the role of approximate policy evaluation in a sampled, function-approximated generalized policy iteration.

Generalized advantage estimation

How should the advantage be estimated? The one-step TD residual δt=Rt+1+γvϕ(St+1)vϕ(St)\delta_t = R_{t+1} + \discount\valuefn_\phi(S_{t+1}) - \valuefn_\phi(S_t) is itself a low-variance, high-bias advantage estimate; the full Monte Carlo advantage Gtvϕ(St)\return_t - \valuefn_\phi(S_t) is the reverse. Generalized advantage estimation Schulman et al. (2016) interpolates with an exponential weighting.

Proposition 8.1 (The GAE recursion).

The generalized advantage estimator

AtGAE(γ,λ)l0(γλ)lδt+l\advantage^{\text{GAE}(\discount,\lambda)}_t \defeq \sum_{l\ge 0}(\discount\lambda)^l\,\delta_{t+l}

satisfies the backward recursion

AtGAE=δt+γλAt+1GAE.\advantage^{\text{GAE}}_t = \delta_t + \discount\lambda\,\advantage^{\text{GAE}}_{t+1}.

Its endpoints are λ=0\lambda=0, giving At=δt\advantage_t=\delta_t (low variance, high bias), and λ=1\lambda=1, giving At=l0γlδt+l=Gtvϕ(St)\advantage_t=\sum_{l\ge0}\discount^l\delta_{t+l}=\return_t-\valuefn_\phi(S_t) (the Monte Carlo advantage: high variance, low bias).

Proof.

Split the defining sum at its first term:

l0(γλ)lδt+l=δt+l1(γλ)lδt+l=δt+γλl0(γλ)lδt+1+l=δt+γλAt+1GAE.\sum_{l\ge 0}(\discount\lambda)^l\delta_{t+l} = \delta_t + \sum_{l\ge 1}(\discount\lambda)^l\delta_{t+l} = \delta_t + \discount\lambda\sum_{l\ge 0}(\discount\lambda)^l\delta_{t+1+l} = \delta_t + \discount\lambda\,\advantage^{\text{GAE}}_{t+1}.

For λ=1\lambda=1, l0γlδt+l\sum_{l\ge0}\discount^l\delta_{t+l} telescopes: writing δt+l=Rt+l+1+γvϕ(St+l+1)vϕ(St+l)\delta_{t+l} = R_{t+l+1}+\discount\valuefn_\phi(S_{t+l+1})-\valuefn_\phi(S_{t+l}), the value terms cancel in pairs, leaving l0γlRt+l+1vϕ(St)=Gtvϕ(St)\sum_{l\ge0}\discount^l R_{t+l+1} - \valuefn_\phi(S_t) = \return_t-\valuefn_\phi(S_t). \qquad\blacksquare

The recursion is what the companion computes in one backward pass.

Intermediate λ\lambda (commonly 0.95\approx 0.95) usually beats both endpoints — the same lesson as nn-step TD, now applied to the advantage that weights the policy gradient.

Trust regions: TRPO and PPO

REINFORCE and A2C take one gradient step per batch because the advantage estimate is only valid near the policy that generated the data; a large step can collapse performance. The fix is to bound the update — to control the step size in policy space rather than parameter space.

TRPO Schulman et al. (2015) makes this literal: maximize the importance-weighted surrogate E[ρtAt]\E[\rho_t\,\advantage_t], with ratio ρt=πθ(AtSt)/πold(AtSt)\rho_t = \policy_\theta(A_t\mid S_t)/\policy_{\text{old}}(A_t\mid S_t), subject to a trust-region constraint E[KL(πoldπθ)]δ\E[\mathrm{KL}(\policy_{\text{old}}\,\|\,\policy_\theta)] \le \delta. The KL ball is the trust region; within it the linearized objective is reliable.

PPO Schulman et al. (2017) replaces the hard constraint with a cheaper clipped surrogate,

LCLIP(θ)=E[min(ρtAt, clip(ρt,1ϵ,1+ϵ)At)].L^{\text{CLIP}}(\theta) = \E\big[\min\big(\rho_t\,\advantage_t,\ \mathrm{clip}(\rho_t, 1-\epsilon, 1+\epsilon)\,\advantage_t\big)\big].

The min\min makes LCLIPL^{\text{CLIP}} a pessimistic lower bound on the unclipped surrogate: when an advantage is positive, the objective stops rewarding increases in ρt\rho_t once ρt>1+ϵ\rho_t > 1+\epsilon (its gradient there is zero); when negative, it stops once ρt<1ϵ\rho_t < 1-\epsilon.

Either way there is no incentive to push the policy far from πold\policy_{\text{old}} in a single update, so PPO can safely take several epochs of minibatch updates per batch — the sample reuse REINFORCE lacked.

Implementation matters

A sobering empirical fact closes the family. Engstrom et al. Engstrom et al. (2020) showed that much of PPO’s advantage over TRPO comes not from the clipped objective but from code-level details — advantage normalization, value-function clipping, reward scaling, learning-rate annealing, orthogonal initialization — applied alongside it. Together with the broader reproducibility findings of Henderson et al. Henderson et al. (2018) , the lesson is that on-policy deep RL results must be read with the implementation, not just the algorithm, in view. The companion therefore tests the components (the GAE recursion, advantage normalization, the clip) and checks learning on a simple task, rather than chasing a benchmark number.

The dynamic-programming bridge

Actor-critic completes the generalized-policy-iteration picture under approximation. The critic is approximate policy evaluation (Chapter 1’s vπ\valuefn_\policy, learned from samples); the gradient actor is approximate improvement. Where policy iteration took the full greedy jump to arg maxaq\argmax_a \qfn, the actor takes a small step, and the trust region (TRPO’s KL ball, PPO’s clip) is exactly the step-size control that keeps improvement reliable when evaluation is noisy and local. Two bridges:

  • To continuous control (Week 9). Everything here is on-policy and stochastic; Week 9 turns to off-policy continuous control (DDPG, TD3, SAC), trading sample reuse-by-replay for the on-policy stability bought here by trust regions.
  • To optimal control (Part II). A trust region on the update is a regularized step — the policy-space analogue of a line search or a Levenberg–Marquardt damping in optimization, and a cousin of MPC’s receding horizon as a bound on how far ahead a single decision commits.

What’s next

  • Week 9 leaves on-policy methods for off-policy continuous control: the deterministic policy gradient (DDPG), its twin-critic fix (TD3), and maximum-entropy RL (SAC) — where the entropy bonus of Chapter 7 becomes the objective.

Exercises

  1. (Derive) Derive the GAE recursion AtGAE=δt+γλAt+1GAE\advantage^{\text{GAE}}_t = \delta_t + \discount\lambda\,\advantage^{\text{GAE}}_{t+1} from the exponential sum, and show λ=1\lambda=1 gives the Monte Carlo advantage (Prop. 8.1).

    Solution

    Split the sum at l=0l=0: l0(γλ)lδt+l=δt+γλl0(γλ)lδt+1+l=δt+γλAt+1GAE\sum_{l\ge0}(\discount\lambda)^l\delta_{t+l} = \delta_t + \discount\lambda\sum_{l\ge0}(\discount\lambda)^l\delta_{t+1+l} = \delta_t + \discount\lambda\advantage^{\text{GAE}}_{t+1}. At λ=1\lambda=1 the value terms in lγlδt+l\sum_l\discount^l\delta_{t+l} telescope to Gtvϕ(St)\return_t-\valuefn_\phi(S_t).

  2. (Prove) Show LCLIPL^{\text{CLIP}} is a lower bound on the unclipped surrogate E[ρtAt]\E[\rho_t\advantage_t], and identify where its gradient with respect to ρt\rho_t is zero.

    Solution

    min(ρA,clip(ρ)A)ρA\min(\rho\advantage, \mathrm{clip}(\rho)\advantage) \le \rho\advantage pointwise, so the expectation is a lower bound. For A>0\advantage>0 the clip caps the term at (1+ϵ)A(1+\epsilon)\advantage once ρ>1+ϵ\rho>1+\epsilon, where /ρ=0\partial/\partial\rho = 0; for A<0\advantage<0 it floors at (1ϵ)A(1-\epsilon)\advantage once ρ<1ϵ\rho<1-\epsilon, again with zero gradient. Inside [1ϵ,1+ϵ][1-\epsilon,1+\epsilon] the gradient is A\advantage.

  3. (Compute) With γ=0.99\discount=0.99, λ=0.95\lambda=0.95, and TD residuals δ=(1.0,0.5,0.2)\delta = (1.0, 0.5, -0.2) at the end of an episode (bootstrap zero), compute AGAE\advantage^{\text{GAE}}.

    Solution

    Backward: A2=0.2\advantage_2 = -0.2; A1=0.5+0.990.95(0.2)=0.3119\advantage_1 = 0.5 + 0.99\cdot0.95\cdot(-0.2) = 0.3119; A0=1.0+0.990.950.3119=1.2933\advantage_0 = 1.0 + 0.99\cdot0.95\cdot0.3119 = 1.2933. (The companion’s compute_gae reproduces these.)

  4. (Implement) In the companion, verify the GAE recursion matches the exponential sum and that λ=1\lambda=1 equals returns minus values; that advantage normalization produces mean-0/unit-variance advantages; that the PPO clip flattens the objective outside [1ϵ,1+ϵ][1-\epsilon,1+\epsilon]; and that PPO learns CartPole above the random baseline.

    Solution

    See experiments/python/week08/test_ppo.py: compute_gae vs the brute-force (γλ)lδ\sum(\discount\lambda)^l\delta; the λ=1\lambda=1 telescoping identity; the normalization statistics; the clip’s value/gradient outside the range; and a seeded PPO CartPole run clearing the ~22 random baseline.

  5. (Extend) Sweep the PPO clip range ϵ\epsilon and add KL early stopping; compare stability across seeds.

    Solution

    Smaller ϵ\epsilon tightens the trust region (more stable, slower); larger loosens it (faster, riskier). KL early stopping halts the epoch loop once the policy has moved a target KL from πold\policy_{\text{old}}, a direct enforcement of the trust region the clip only approximates — the companion’s --clip/--target-kl flags expose both.

Companion code

The Week-8 companion lives at experiments/python/week08/ and is PyTorch A2C and PPO on CartPole-v1, sharing one actor-critic network.

  • ppo.pycompute_gae (the backward recursion), an ActorCritic network, the ppo_clip_objective, and a PPO/A2C training loop with advantage normalization and a configurable clip range and epoch count.
  • test_ppo.py — component-correctness tests (compute_gae vs the closed-form exponential sum; the λ=1\lambda=1 = returns − values identity; advantage normalization; the PPO clip’s value and zero gradient outside [1ϵ,1+ϵ][1-\epsilon,1+\epsilon]) plus a seeded PPO CartPole run learning above the random baseline.
# component tests + a seeded CartPole learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week08/test_ppo.py -q

# worked PPO training run on CartPole
PYTHONPATH=. python experiments/python/week08/ppo.py --updates 150