Part I · Foundations Week 9 Published td3.py test_td3.py

Off-Policy Continuous Control: DDPG, TD3, and SAC

Off-policy actor-critic for continuous actions: the deterministic policy gradient (DDPG), the twin-critic overestimation fix (TD3), and maximum-entropy RL (SAC). Why the soft-optimal policy is Boltzmann in the action-value, and how maximum-entropy RL is KL-regularized optimal control — the bridge from learning to Part II.

On this page

The deterministic policy gradient
Overestimation returns: TD3
Maximum-entropy RL and SAC
The dynamic-programming bridge
What’s next
Exercises
Companion code

Off-Policy Continuous Control: DDPG, TD3, and SAC

Where we are. Weeks 7–8 optimized stochastic policies on-policy. This chapter — the last of the RL foundations — turns to off-policy continuous control, where two ideas dominate the model-free baselines. First, the deterministic policy gradient (DDPG) makes continuous actions tractable by pushing the gradient through a differentiable critic instead of sampling them. Second, maximum-entropy RL (SAC) augments the reward with policy entropy, which both stabilizes learning and reveals a deep identity: maximum-entropy RL is KL-regularized optimal control — the bridge out of learning and into Part II. Between them, TD3 fixes the overestimation that Chapter 6 first diagnosed, now arising from the critic’s own bootstrap.

Chapter 9 — at a glance

Goal. State the deterministic policy gradient and read DDPG off it; see how TD3’s twin critics fix overestimation; derive the soft-optimal Boltzmann policy of maximum-entropy RL; and identify maximum-entropy RL with KL-regularized control.

Reading time. ~35 minutes; ~55 with the proofs and exercises.

Key insight — the DP bridge. Continuous actions turn the Bellman backup’s $\max_a$ into an optimization over a continuum: DDPG/TD3 solve it with a gradient actor through the critic; SAC softens it into a $\log\!\sum\exp$ . That soft maximum is exactly the value function of KL-regularized control (Todorov’s linearly-solvable MDPs), and the soft Bellman operator remains a contraction. This is where reinforcement learning and optimal control finally speak the same language — the LQR and MPC of Part II are the model-based, deterministic limit.

The deterministic policy gradient

With a continuous action space the stochastic policy gradient (Chapter 7) must average the score over actions — expensive and high-variance. A deterministic policy $\mu_\theta:\statespace\to\actionspace$ avoids the action integral, and its gradient flows through the critic by the chain rule.

Theorem 9.1 (Deterministic policy gradient).

For a deterministic policy $\mu_\theta$ and its action-value $\qfn_{\mu_\theta}$ , under mild regularity the gradient of the off-policy objective is

\nabla_\theta J(\theta) = \E_{s\sim\rho}\!\Big[\,\nabla_\theta\mu_\theta(s)\,\nabla_a \qfn_{\mu_\theta}(s,a)\big|_{a=\mu_\theta(s)}\Big],

where $\rho$ is the (off-policy) state distribution. The policy is improved by pushing its output up the critic’s action-gradient.

This is the continuous-action analogue of greedy improvement: where a discrete agent takes $\argmax_a \qfn$ , the deterministic actor takes a gradient step in $a$ toward larger $\qfn$ .

Silver et al. Silver et al. (2014) proved the theorem (as the zero-variance limit of the stochastic gradient); Lillicrap et al. Lillicrap et al. (2016) turned it into DDPG — the deterministic actor and critic trained off-policy with the replay buffer and target networks of Chapter 6, and exploration noise added to the actor’s output.

Overestimation returns: TD3

DDPG inherits the deadly-triad fragility and the overestimation bias of Chapter 6 — now produced by the critic bootstrapping on its own optimistic estimates. TD3 Fujimoto et al. (2018) applies three fixes, the first echoing Double DQN:

Twin critics, take the minimum. Train two critics $\qfn_{\phi_1},\qfn_{\phi_2}$ and form the target with $\min(\qfn_{\phi_1},\qfn_{\phi_2})$ — clipped double Q-learning, a continuous cousin of Double DQN that caps the upward bias of Proposition 6.1.
Delayed policy updates. Update the actor (and targets) less often than the critics, so the policy chases a more settled value.
Target policy smoothing. Add clipped noise to the target action, regularizing the critic against sharp peaks it could otherwise exploit.

The first is the load-bearing one: taking the minimum of two independent critics systematically underestimates, which is far safer for a bootstrapped target than the overestimate the single-critic max produces.

Maximum-entropy RL and SAC

A different idea reshapes the objective itself. Maximum-entropy RL adds the policy’s entropy $\mathcal{H}(\policy(\cdot\mid s))$ to the reward, scaled by a temperature $\alpha$ :

J(\policy) = \E_\policy\!\Big[\sum_t \reward(S_t,A_t) + \alpha\,\mathcal{H}\big(\policy(\cdot\mid S_t)\big)\Big].

The agent is rewarded for acting well and for staying stochastic — sustaining exploration and robustness. Soft actor-critic Haarnoja et al. (2018) is the off-policy actor-critic for this objective, with an automatically tuned temperature Haarnoja et al. (2019) . The one-step soft problem has a clean optimum.

Proposition 9.1 (The soft-optimal policy is Boltzmann).

Maximizing $\sum_a \policy(a\mid s)\,\qfn(s,a) + \alpha\,\mathcal{H}(\policy(\cdot\mid s))$ over distributions $\policy(\cdot\mid s)$ gives the Boltzmann policy

\policy^*(a\mid s) = \frac{\exp\!\big(\qfn(s,a)/\alpha\big)}{\sum_{a'}\exp\!\big(\qfn(s,a')/\alpha\big)},

with optimal soft value $\valuefn^*_{\text{soft}}(s) = \alpha\log\sum_a\exp(\qfn(s,a)/\alpha)$ . As $\alpha\to0$ this recovers the greedy $\argmax$ and the ordinary value.

Proof.

Write $\mathcal{H}(\policy)=-\sum_a\policy(a\mid s)\log\policy(a\mid s)$ and maximize $\sum_a\policy(a\mid s)\big[\qfn(s,a)-\alpha\log\policy(a\mid s)\big]$ subject to $\sum_a\policy(a\mid s)=1$ . The Lagrangian’s stationarity in $\policy(a\mid s)$ gives $\qfn(s,a)-\alpha\log\policy(a\mid s)-\alpha-\lambda=0$ , so $\log\policy(a\mid s) = \qfn(s,a)/\alpha + \text{const}$ , i.e. $\policy(a\mid s)\propto\exp(\qfn(s,a)/\alpha)$ . Normalizing gives the Boltzmann form; substituting back gives $\valuefn^*_{\text{soft}}(s)=\alpha\log\sum_a\exp(\qfn(s,a)/\alpha)$ , the log-sum-exp “soft maximum.” As $\alpha\to0$ the soft max $\to\max_a\qfn$ and $\policy^*\to$ greedy. $\qquad\blacksquare$

The entropy bonus of Chapter 7 has been promoted from a heuristic to the objective, and its optimum is a temperature-controlled softmax over the action-value.

The dynamic-programming bridge

Maximum-entropy RL is where learning rejoins control. The soft value $\alpha\log\sum_a\exp(\qfn/\alpha)$ is the optimal cost-to-go of a KL-regularized control problem: maximizing reward minus $\alpha\,\mathrm{KL}(\policy\,\|\,\policy_0)$ against a reference $\policy_0$ yields exactly the Boltzmann policy of Proposition 9.1, and Todorov’s linearly-solvable MDPs exploit precisely this — under the exponential transform $z = \exp(\valuefn_{\text{soft}}/\alpha)$ the soft Bellman equation becomes linear. Three threads close Part I:

Continuous-action improvement is the deterministic actor (DPG) or the soft Boltzmann policy (SAC), replacing the discrete $\argmax$ — approximate policy iteration (Chapter 1) in a continuum.
Overestimation control (TD3’s twin-min) is the same bias management as Double DQN (Chapter 6), now load-bearing because the bootstrap runs through a critic.
To Part II. KL-regularized control, the maximum-entropy LQR with its closed form, and the deterministic limit ( $\alpha\to0$ ) that is classical optimal control are the entry points to LQR (Week 13) and MPC (Week 15) — the model-based, deterministic side of the same fixed point.

What’s next

Week 10 steps back to ask where RL has actually worked in the real world — a survey of robotics successes (locomotion, manipulation, flight, plasma control) and the conditions that made them possible.
Part II (Week 11+) then changes register entirely, to control theory: stability, LQR, and model predictive control — met now from the RL side, and rejoined with it in Part III.

Exercises

(Derive) Starting from $J(\theta)=\E_{s\sim\rho}[\qfn_{\mu_\theta}(s,\mu_\theta(s))]$ , derive the deterministic policy gradient by the chain rule (Theorem 9.1).

Solution
$\nabla_\theta\qfn(s,\mu_\theta(s)) = \nabla_\theta\mu_\theta(s)\,\nabla_a\qfn(s,a)|_{a=\mu_\theta(s)}$ by the chain rule (the explicit $s$ -dependence of $\qfn$ is held fixed); taking the expectation over $s\sim\rho$ gives Theorem 9.1. The state distribution $\rho$ may be off-policy, which is why DDPG can learn from a replay buffer.
(Prove) Show the maximum-entropy one-step optimal policy is $\policy^*(a\mid s)\propto\exp(\qfn(s,a)/\alpha)$ with soft value $\alpha\log\sum_a\exp(\qfn(s,a)/\alpha)$ (Prop. 9.1).

Solution
Maximize $\sum_a\policy(\qfn-\alpha\log\policy)$ under $\sum_a\policy=1$ ; stationarity gives $\qfn-\alpha\log\policy-\alpha-\lambda=0$ , so $\policy\propto\exp(\qfn/\alpha)$ . Substituting the normalized policy back yields the log-sum-exp soft value. The $\alpha\to0$ limit recovers the hard $\max$ .
(Compute) A target state has twin-critic estimates $\qfn_{\phi_1}=2.0$ , $\qfn_{\phi_2}=1.4$ for the target action. What value does TD3 use, and why is the minimum the safer choice for a bootstrap target?

Solution
TD3 uses $\min(2.0,1.4)=1.4$ . The single-critic max/overestimate (Prop. 6.1) compounds through bootstrapping; taking the minimum of two independent estimates biases downward, and a slight underestimate does not amplify across the backup the way an overestimate does.
(Implement) In the companion, verify the twin-critic minimum lowers the target versus a single critic; that target-policy-smoothing noise is clipped to range; the Polyak soft target update; and that minimal TD3 learns Pendulum above the random return.

Solution
See experiments/python/week09/test_td3.py: the clipped-double-Q target equals the per-sample minimum of the twin critics (≤ either); the smoothing noise and target action respect their clip bounds; the Polyak update matches its closed form; and a seeded TD3 run on Pendulum-v1 clears the random-return baseline by a wide margin.
(Extend) Sweep the SAC temperature $\alpha$ and relate the limit $\alpha\to0$ (greedy) and large $\alpha$ (uniform). (The roadmap’s JAX/Brax SAC baseline is deferred to the dedicated JAX track.)

Solution
Small $\alpha$ concentrates the Boltzmann policy on the $\argmax$ (exploitation, recovering ordinary RL); large $\alpha$ flattens it toward uniform (maximal exploration). Automatic temperature tuning Haarnoja et al. (2019) adjusts $\alpha$ to hold a target entropy rather than fixing it by hand.

Companion code

The Week-9 companion lives at experiments/python/week09/ and is a minimal TD3 on Pendulum-v1 (the chapter’s testable centerpiece), with Stable-Baselines3 named as the reference baseline.

td3.py — a continuous-action ReplayBuffer, a deterministic Actor and twin Critics, the exposed td3_target (clipped double-Q with target-policy smoothing), Polyak soft_update, and the training loop. Pure PyTorch.
test_td3.py — component-correctness tests (the twin-critic minimum lowers the target; smoothing-noise and target-action clipping; the Polyak update’s closed form) plus a seeded Pendulum-v1 run learning well above the random return.

# component tests + a seeded Pendulum learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week09/test_td3.py -q

# worked minimal-TD3 training run on Pendulum
PYTHONPATH=. python experiments/python/week09/td3.py --steps 40000