Off-Policy Continuous Control: DDPG, TD3, and SAC
Off-policy actor-critic for continuous actions: the deterministic policy gradient (DDPG), the twin-critic overestimation fix (TD3), and maximum-entropy RL (SAC). Why the soft-optimal policy is Boltzmann in the action-value, and how maximum-entropy RL is KL-regularized optimal control — the bridge from learning to Part II.
On this page
Off-Policy Continuous Control: DDPG, TD3, and SAC
Where we are. Weeks 7–8 optimized stochastic policies on-policy. This chapter — the last of the RL foundations — turns to off-policy continuous control, where two ideas dominate the model-free baselines. First, the deterministic policy gradient (DDPG) makes continuous actions tractable by pushing the gradient through a differentiable critic instead of sampling them. Second, maximum-entropy RL (SAC) augments the reward with policy entropy, which both stabilizes learning and reveals a deep identity: maximum-entropy RL is KL-regularized optimal control — the bridge out of learning and into Part II. Between them, TD3 fixes the overestimation that Chapter 6 first diagnosed, now arising from the critic’s own bootstrap.
The deterministic policy gradient
With a continuous action space the stochastic policy gradient (Chapter 7) must average the score over actions — expensive and high-variance. A deterministic policy avoids the action integral, and its gradient flows through the critic by the chain rule.
For a deterministic policy and its action-value , under mild regularity the gradient of the off-policy objective is
where is the (off-policy) state distribution. The policy is improved by pushing its output up the critic’s action-gradient.
This is the continuous-action analogue of greedy improvement: where a discrete agent takes , the deterministic actor takes a gradient step in toward larger . Silver et al. Silver et al. (2014) proved the theorem (as the zero-variance limit of the stochastic gradient); Lillicrap et al. Lillicrap et al. (2016) turned it into DDPG — the deterministic actor and critic trained off-policy with the replay buffer and target networks of Chapter 6, and exploration noise added to the actor’s output.
Overestimation returns: TD3
DDPG inherits the deadly-triad fragility and the overestimation bias of Chapter 6 — now produced by the critic bootstrapping on its own optimistic estimates. TD3 Fujimoto et al. (2018) applies three fixes, the first echoing Double DQN:
- Twin critics, take the minimum. Train two critics and form the target with — clipped double Q-learning, a continuous cousin of Double DQN that caps the upward bias of Proposition 6.1.
- Delayed policy updates. Update the actor (and targets) less often than the critics, so the policy chases a more settled value.
- Target policy smoothing. Add clipped noise to the target action, regularizing the critic against sharp peaks it could otherwise exploit.
The first is the load-bearing one: taking the minimum of two independent critics systematically underestimates, which is far safer for a bootstrapped target than the overestimate the single-critic max produces.
Maximum-entropy RL and SAC
A different idea reshapes the objective itself. Maximum-entropy RL adds the policy’s entropy to the reward, scaled by a temperature :
The agent is rewarded for acting well and for staying stochastic — sustaining exploration and robustness. Soft actor-critic Haarnoja et al. (2018) is the off-policy actor-critic for this objective, with an automatically tuned temperature Haarnoja et al. (2019) . The one-step soft problem has a clean optimum.
Maximizing over distributions gives the Boltzmann policy
with optimal soft value . As this recovers the greedy and the ordinary value.
Write and maximize subject to . The Lagrangian’s stationarity in gives , so , i.e. . Normalizing gives the Boltzmann form; substituting back gives , the log-sum-exp “soft maximum.” As the soft max and greedy.
The entropy bonus of Chapter 7 has been promoted from a heuristic to the objective, and its optimum is a temperature-controlled softmax over the action-value.
The dynamic-programming bridge
Maximum-entropy RL is where learning rejoins control. The soft value is the optimal cost-to-go of a KL-regularized control problem: maximizing reward minus against a reference yields exactly the Boltzmann policy of Proposition 9.1, and Todorov’s linearly-solvable MDPs exploit precisely this — under the exponential transform the soft Bellman equation becomes linear. Three threads close Part I:
- Continuous-action improvement is the deterministic actor (DPG) or the soft Boltzmann policy (SAC), replacing the discrete — approximate policy iteration (Chapter 1) in a continuum.
- Overestimation control (TD3’s twin-min) is the same bias management as Double DQN (Chapter 6), now load-bearing because the bootstrap runs through a critic.
- To Part II. KL-regularized control, the maximum-entropy LQR with its closed form, and the deterministic limit () that is classical optimal control are the entry points to LQR (Week 13) and MPC (Week 15) — the model-based, deterministic side of the same fixed point.
What’s next
- Week 10 steps back to ask where RL has actually worked in the real world — a survey of robotics successes (locomotion, manipulation, flight, plasma control) and the conditions that made them possible.
- Part II (Week 11+) then changes register entirely, to control theory: stability, LQR, and model predictive control — met now from the RL side, and rejoined with it in Part III.
Exercises
-
(Derive) Starting from , derive the deterministic policy gradient by the chain rule (Theorem 9.1).
Solution
by the chain rule (the explicit -dependence of is held fixed); taking the expectation over gives Theorem 9.1. The state distribution may be off-policy, which is why DDPG can learn from a replay buffer.
-
(Prove) Show the maximum-entropy one-step optimal policy is with soft value (Prop. 9.1).
Solution
Maximize under ; stationarity gives , so . Substituting the normalized policy back yields the log-sum-exp soft value. The limit recovers the hard .
-
(Compute) A target state has twin-critic estimates , for the target action. What value does TD3 use, and why is the minimum the safer choice for a bootstrap target?
Solution
TD3 uses . The single-critic max/overestimate (Prop. 6.1) compounds through bootstrapping; taking the minimum of two independent estimates biases downward, and a slight underestimate does not amplify across the backup the way an overestimate does.
-
(Implement) In the companion, verify the twin-critic minimum lowers the target versus a single critic; that target-policy-smoothing noise is clipped to range; the Polyak soft target update; and that minimal TD3 learns Pendulum above the random return.
Solution
See
experiments/python/week09/test_td3.py: the clipped-double-Q target equals the per-sample minimum of the twin critics (≤ either); the smoothing noise and target action respect their clip bounds; the Polyak update matches its closed form; and a seeded TD3 run onPendulum-v1clears the random-return baseline by a wide margin. -
(Extend) Sweep the SAC temperature and relate the limit (greedy) and large (uniform). (The roadmap’s JAX/Brax SAC baseline is deferred to the dedicated JAX track.)
Solution
Small concentrates the Boltzmann policy on the (exploitation, recovering ordinary RL); large flattens it toward uniform (maximal exploration). Automatic temperature tuning Haarnoja et al. (2019) adjusts to hold a target entropy rather than fixing it by hand.
Companion code
The Week-9 companion lives at experiments/python/week09/ and is a minimal TD3 on
Pendulum-v1 (the chapter’s testable centerpiece), with Stable-Baselines3 named as the
reference baseline.
td3.py— a continuous-actionReplayBuffer, a deterministicActorand twinCritics, the exposedtd3_target(clipped double-Q with target-policy smoothing), Polyaksoft_update, and the training loop. Pure PyTorch.test_td3.py— component-correctness tests (the twin-critic minimum lowers the target; smoothing-noise and target-action clipping; the Polyak update’s closed form) plus a seededPendulum-v1run learning well above the random return.
# component tests + a seeded Pendulum learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week09/test_td3.py -q
# worked minimal-TD3 training run on Pendulum
PYTHONPATH=. python experiments/python/week09/td3.py --steps 40000