Part I · Foundations Week 9 Published td3.py test_td3.py

Off-Policy Continuous Control: DDPG, TD3, and SAC

Off-policy actor-critic for continuous actions: the deterministic policy gradient (DDPG), the twin-critic overestimation fix (TD3), and maximum-entropy RL (SAC). Why the soft-optimal policy is Boltzmann in the action-value, and how maximum-entropy RL is KL-regularized optimal control — the bridge from learning to Part II.

On this page
  1. The deterministic policy gradient
  2. Overestimation returns: TD3
  3. Maximum-entropy RL and SAC
  4. The dynamic-programming bridge
  5. What’s next
  6. Exercises
  7. Companion code

Off-Policy Continuous Control: DDPG, TD3, and SAC

Where we are. Weeks 7–8 optimized stochastic policies on-policy. This chapter — the last of the RL foundations — turns to off-policy continuous control, where two ideas dominate the model-free baselines. First, the deterministic policy gradient (DDPG) makes continuous actions tractable by pushing the gradient through a differentiable critic instead of sampling them. Second, maximum-entropy RL (SAC) augments the reward with policy entropy, which both stabilizes learning and reveals a deep identity: maximum-entropy RL is KL-regularized optimal control — the bridge out of learning and into Part II. Between them, TD3 fixes the overestimation that Chapter 6 first diagnosed, now arising from the critic’s own bootstrap.

The deterministic policy gradient

With a continuous action space the stochastic policy gradient (Chapter 7) must average the score over actions — expensive and high-variance. A deterministic policy μθ:SA\mu_\theta:\statespace\to\actionspace avoids the action integral, and its gradient flows through the critic by the chain rule.

Theorem 9.1 (Deterministic policy gradient).

For a deterministic policy μθ\mu_\theta and its action-value qμθ\qfn_{\mu_\theta}, under mild regularity the gradient of the off-policy objective is

θJ(θ)=Esρ ⁣[θμθ(s)aqμθ(s,a)a=μθ(s)],\nabla_\theta J(\theta) = \E_{s\sim\rho}\!\Big[\,\nabla_\theta\mu_\theta(s)\,\nabla_a \qfn_{\mu_\theta}(s,a)\big|_{a=\mu_\theta(s)}\Big],

where ρ\rho is the (off-policy) state distribution. The policy is improved by pushing its output up the critic’s action-gradient.

This is the continuous-action analogue of greedy improvement: where a discrete agent takes arg maxaq\argmax_a \qfn, the deterministic actor takes a gradient step in aa toward larger q\qfn.

Silver et al. Silver et al. (2014) proved the theorem (as the zero-variance limit of the stochastic gradient); Lillicrap et al. Lillicrap et al. (2016) turned it into DDPG — the deterministic actor and critic trained off-policy with the replay buffer and target networks of Chapter 6, and exploration noise added to the actor’s output.

Overestimation returns: TD3

DDPG inherits the deadly-triad fragility and the overestimation bias of Chapter 6 — now produced by the critic bootstrapping on its own optimistic estimates. TD3 Fujimoto et al. (2018) applies three fixes, the first echoing Double DQN:

  1. Twin critics, take the minimum. Train two critics qϕ1,qϕ2\qfn_{\phi_1},\qfn_{\phi_2} and form the target with min(qϕ1,qϕ2)\min(\qfn_{\phi_1},\qfn_{\phi_2})clipped double Q-learning, a continuous cousin of Double DQN that caps the upward bias of Proposition 6.1.
  2. Delayed policy updates. Update the actor (and targets) less often than the critics, so the policy chases a more settled value.
  3. Target policy smoothing. Add clipped noise to the target action, regularizing the critic against sharp peaks it could otherwise exploit.

The first is the load-bearing one: taking the minimum of two independent critics systematically underestimates, which is far safer for a bootstrapped target than the overestimate the single-critic max produces.

Maximum-entropy RL and SAC

A different idea reshapes the objective itself. Maximum-entropy RL adds the policy’s entropy H(π(s))\mathcal{H}(\policy(\cdot\mid s)) to the reward, scaled by a temperature α\alpha:

J(π)=Eπ ⁣[tr(St,At)+αH(π(St))].J(\policy) = \E_\policy\!\Big[\sum_t \reward(S_t,A_t) + \alpha\,\mathcal{H}\big(\policy(\cdot\mid S_t)\big)\Big].

The agent is rewarded for acting well and for staying stochastic — sustaining exploration and robustness. Soft actor-critic Haarnoja et al. (2018) is the off-policy actor-critic for this objective, with an automatically tuned temperature Haarnoja et al. (2019) . The one-step soft problem has a clean optimum.

Proposition 9.1 (The soft-optimal policy is Boltzmann).

Maximizing aπ(as)q(s,a)+αH(π(s))\sum_a \policy(a\mid s)\,\qfn(s,a) + \alpha\,\mathcal{H}(\policy(\cdot\mid s)) over distributions π(s)\policy(\cdot\mid s) gives the Boltzmann policy

π(as)=exp ⁣(q(s,a)/α)aexp ⁣(q(s,a)/α),\policy^*(a\mid s) = \frac{\exp\!\big(\qfn(s,a)/\alpha\big)}{\sum_{a'}\exp\!\big(\qfn(s,a')/\alpha\big)},

with optimal soft value vsoft(s)=αlogaexp(q(s,a)/α)\valuefn^*_{\text{soft}}(s) = \alpha\log\sum_a\exp(\qfn(s,a)/\alpha). As α0\alpha\to0 this recovers the greedy arg max\argmax and the ordinary value.

Proof.

Write H(π)=aπ(as)logπ(as)\mathcal{H}(\policy)=-\sum_a\policy(a\mid s)\log\policy(a\mid s) and maximize aπ(as)[q(s,a)αlogπ(as)]\sum_a\policy(a\mid s)\big[\qfn(s,a)-\alpha\log\policy(a\mid s)\big] subject to aπ(as)=1\sum_a\policy(a\mid s)=1. The Lagrangian’s stationarity in π(as)\policy(a\mid s) gives q(s,a)αlogπ(as)αλ=0\qfn(s,a)-\alpha\log\policy(a\mid s)-\alpha-\lambda=0, so logπ(as)=q(s,a)/α+const\log\policy(a\mid s) = \qfn(s,a)/\alpha + \text{const}, i.e. π(as)exp(q(s,a)/α)\policy(a\mid s)\propto\exp(\qfn(s,a)/\alpha). Normalizing gives the Boltzmann form; substituting back gives vsoft(s)=αlogaexp(q(s,a)/α)\valuefn^*_{\text{soft}}(s)=\alpha\log\sum_a\exp(\qfn(s,a)/\alpha), the log-sum-exp “soft maximum.” As α0\alpha\to0 the soft max maxaq\to\max_a\qfn and π\policy^*\to greedy. \qquad\blacksquare

The entropy bonus of Chapter 7 has been promoted from a heuristic to the objective, and its optimum is a temperature-controlled softmax over the action-value.

The dynamic-programming bridge

Maximum-entropy RL is where learning rejoins control. The soft value αlogaexp(q/α)\alpha\log\sum_a\exp(\qfn/\alpha) is the optimal cost-to-go of a KL-regularized control problem: maximizing reward minus αKL(ππ0)\alpha\,\mathrm{KL}(\policy\,\|\,\policy_0) against a reference π0\policy_0 yields exactly the Boltzmann policy of Proposition 9.1, and Todorov’s linearly-solvable MDPs exploit precisely this — under the exponential transform z=exp(vsoft/α)z = \exp(\valuefn_{\text{soft}}/\alpha) the soft Bellman equation becomes linear. Three threads close Part I:

  • Continuous-action improvement is the deterministic actor (DPG) or the soft Boltzmann policy (SAC), replacing the discrete arg max\argmax — approximate policy iteration (Chapter 1) in a continuum.
  • Overestimation control (TD3’s twin-min) is the same bias management as Double DQN (Chapter 6), now load-bearing because the bootstrap runs through a critic.
  • To Part II. KL-regularized control, the maximum-entropy LQR with its closed form, and the deterministic limit (α0\alpha\to0) that is classical optimal control are the entry points to LQR (Week 13) and MPC (Week 15) — the model-based, deterministic side of the same fixed point.

What’s next

  • Week 10 steps back to ask where RL has actually worked in the real world — a survey of robotics successes (locomotion, manipulation, flight, plasma control) and the conditions that made them possible.
  • Part II (Week 11+) then changes register entirely, to control theory: stability, LQR, and model predictive control — met now from the RL side, and rejoined with it in Part III.

Exercises

  1. (Derive) Starting from J(θ)=Esρ[qμθ(s,μθ(s))]J(\theta)=\E_{s\sim\rho}[\qfn_{\mu_\theta}(s,\mu_\theta(s))], derive the deterministic policy gradient by the chain rule (Theorem 9.1).

    Solution

    θq(s,μθ(s))=θμθ(s)aq(s,a)a=μθ(s)\nabla_\theta\qfn(s,\mu_\theta(s)) = \nabla_\theta\mu_\theta(s)\,\nabla_a\qfn(s,a)|_{a=\mu_\theta(s)} by the chain rule (the explicit ss-dependence of q\qfn is held fixed); taking the expectation over sρs\sim\rho gives Theorem 9.1. The state distribution ρ\rho may be off-policy, which is why DDPG can learn from a replay buffer.

  2. (Prove) Show the maximum-entropy one-step optimal policy is π(as)exp(q(s,a)/α)\policy^*(a\mid s)\propto\exp(\qfn(s,a)/\alpha) with soft value αlogaexp(q(s,a)/α)\alpha\log\sum_a\exp(\qfn(s,a)/\alpha) (Prop. 9.1).

    Solution

    Maximize aπ(qαlogπ)\sum_a\policy(\qfn-\alpha\log\policy) under aπ=1\sum_a\policy=1; stationarity gives qαlogπαλ=0\qfn-\alpha\log\policy-\alpha-\lambda=0, so πexp(q/α)\policy\propto\exp(\qfn/\alpha). Substituting the normalized policy back yields the log-sum-exp soft value. The α0\alpha\to0 limit recovers the hard max\max.

  3. (Compute) A target state has twin-critic estimates qϕ1=2.0\qfn_{\phi_1}=2.0, qϕ2=1.4\qfn_{\phi_2}=1.4 for the target action. What value does TD3 use, and why is the minimum the safer choice for a bootstrap target?

    Solution

    TD3 uses min(2.0,1.4)=1.4\min(2.0,1.4)=1.4. The single-critic max/overestimate (Prop. 6.1) compounds through bootstrapping; taking the minimum of two independent estimates biases downward, and a slight underestimate does not amplify across the backup the way an overestimate does.

  4. (Implement) In the companion, verify the twin-critic minimum lowers the target versus a single critic; that target-policy-smoothing noise is clipped to range; the Polyak soft target update; and that minimal TD3 learns Pendulum above the random return.

    Solution

    See experiments/python/week09/test_td3.py: the clipped-double-Q target equals the per-sample minimum of the twin critics (≤ either); the smoothing noise and target action respect their clip bounds; the Polyak update matches its closed form; and a seeded TD3 run on Pendulum-v1 clears the random-return baseline by a wide margin.

  5. (Extend) Sweep the SAC temperature α\alpha and relate the limit α0\alpha\to0 (greedy) and large α\alpha (uniform). (The roadmap’s JAX/Brax SAC baseline is deferred to the dedicated JAX track.)

    Solution

    Small α\alpha concentrates the Boltzmann policy on the arg max\argmax (exploitation, recovering ordinary RL); large α\alpha flattens it toward uniform (maximal exploration). Automatic temperature tuning Haarnoja et al. (2019) adjusts α\alpha to hold a target entropy rather than fixing it by hand.

Companion code

The Week-9 companion lives at experiments/python/week09/ and is a minimal TD3 on Pendulum-v1 (the chapter’s testable centerpiece), with Stable-Baselines3 named as the reference baseline.

  • td3.py — a continuous-action ReplayBuffer, a deterministic Actor and twin Critics, the exposed td3_target (clipped double-Q with target-policy smoothing), Polyak soft_update, and the training loop. Pure PyTorch.
  • test_td3.py — component-correctness tests (the twin-critic minimum lowers the target; smoothing-noise and target-action clipping; the Polyak update’s closed form) plus a seeded Pendulum-v1 run learning well above the random return.
# component tests + a seeded Pendulum learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week09/test_td3.py -q

# worked minimal-TD3 training run on Pendulum
PYTHONPATH=. python experiments/python/week09/td3.py --steps 40000