Part I · Foundations Week 5 Published fa.py test_fa.py

Function Approximation and the Deadly Triad

Replacing the value table with a parametric approximator: linear value functions, semi-gradient TD, and the projected Bellman operator. Why on-policy semi-gradient TD converges to a bounded-error fixed point, and why function approximation, bootstrapping, and off-policy training together can diverge — the deadly triad, witnessed by Baird's counterexample.

On this page
  1. Linear approximation and semi-gradient TD
  2. The projected Bellman operator
  3. When it works: on-policy convergence
  4. The deadly triad
  5. Beyond semi-gradient: LSTD and true-gradient methods
  6. The dynamic-programming bridge
  7. What’s next
  8. Exercises
  9. Companion code

Function Approximation and the Deadly Triad

Where we are. Every method so far stored one number per state in a table. That ends the moment the state space is large or continuous: we must approximate the value function with a parametric model VθV_\theta and let it generalize across states. This chapter is the first where the clean contraction story of Chapters 1–4 breaks. Two results frame it. On-policy, semi-gradient TD still converges — but to a projected fixed point whose error is the representation’s projection error, amplified by 1/(1γ)1/(1-\discount): approximation changes the fixed point itself. Off-policy, the same update can diverge. The combination that breaks is named the deadly triad — function approximation, bootstrapping, and off-policy training — and Baird’s counterexample is its sharpest witness.

Linear approximation and semi-gradient TD

Replace the table VRSV \in \R^{\statespace} with a parametric VθV_\theta. We focus on the linear case, where a feature map ϕ:SRd\phi:\statespace\to\R^d (dSd \ll \lvert\statespace\rvert) gives

Vθ(s)=θϕ(s),θRd.V_\theta(s) = \theta^\top \phi(s), \qquad \theta \in \R^d .

The representable value functions form a dd-dimensional subspace {Φθ:θRd}\{\Phi\theta : \theta\in\R^d\} of RS\R^{\statespace}, where Φ\Phi stacks the feature rows. Learning now adjusts θ\theta, and a value learned at one state moves the values of all states sharing features — generalization, the whole point.

Definition 5.1 (Semi-gradient TD(0)).

Given a transition (St,Rt+1,St+1)(S_t, R_{t+1}, S_{t+1}), semi-gradient TD(0) updates

θ    θ+αδtϕ(St),δt=Rt+1+γθϕ(St+1)θϕ(St).\theta \;\leftarrow\; \theta + \alpha\,\delta_t\,\phi(S_t), \qquad \delta_t = R_{t+1} + \discount\,\theta^\top\phi(S_{t+1}) - \theta^\top\phi(S_t).

It is called semi-gradient because it treats the bootstrap target γθϕ(St+1)\discount\,\theta^\top\phi(S_{t+1}) as fixed — it does not differentiate the target with respect to θ\theta, so the update is not the true gradient of any fixed objective.

That last point is the crux of the chapter.

A true-gradient method on a fixed loss cannot diverge to infinity; semi-gradient TD can, because the “target” it descends toward moves with θ\theta.

The projected Bellman operator

Where does semi-gradient TD converge, when it does? Not to TπVθ\bellman^\policy V_\theta — that generally leaves the representable subspace and is not expressible as any θ\theta. The update implicitly projects it back.

Definition 5.2 (Projection and the TD fixed point).

Let μ\mu be a distribution over states and x,yμ=sμ(s)x(s)y(s)\langle x,y\rangle_\mu = \sum_s \mu(s)x(s)y(s) the weighted inner product, with norm μ\norm{\cdot}_\mu. The projection Πμ\Pi_\mu maps a value function to the nearest representable one, Πμx=argminθΦθxμ\Pi_\mu x = \arg\min_{\theta}\norm{\Phi\theta - x}_\mu. The TD fixed point is the VθV_\theta satisfying

Vθ=ΠμTπVθ,V_\theta = \Pi_\mu\,\bellman^\policy V_\theta ,

the fixed point of the composed projected Bellman operator ΠμTπ\Pi_\mu\bellman^\policy.

Semi-gradient TD is a stochastic-approximation scheme (Chapter 4) for this fixed point: in expectation, under the state-visitation distribution μ\mu its update drives θ\theta toward ΠμTπVθ\Pi_\mu\bellman^\policy V_\theta. Whether that iteration converges hinges entirely on whether ΠμTπ\Pi_\mu\bellman^\policy is a contraction — which depends on μ\mu.

When it works: on-policy convergence

Proposition 5.1 (On-policy convergence and the projection-error bound).

Let μ\mu be the on-policy stationary distribution of π\policy (so μPπ=μ\mu^\top P_\policy = \mu^\top). Then Tπ\bellman^\policy is a γ\discount-contraction in μ\norm{\cdot}_\mu, and Πμ\Pi_\mu is nonexpansive in μ\norm{\cdot}_\mu, so ΠμTπ\Pi_\mu\bellman^\policy is a γ\discount-contraction with a unique fixed point VθV_{\theta^\star}. Its error is the projection error, amplified by the horizon:

Vθvπμ    11γΠμvπvπμ.\norm{V_{\theta^\star} - \valuefn_\policy}_\mu \;\le\; \frac{1}{1-\discount}\, \norm{\Pi_\mu \valuefn_\policy - \valuefn_\policy}_\mu .
Proof.

Two facts compose. (i) Under the stationary μ\mu, the transition operator is nonexpansive in μ\norm{\cdot}_\mu:

Pπxμ2=sμ(s)(sPπ(ss)x(s))2(definition)sμ(s)sPπ(ss)x(s)2(Jensen; Pπ(s) a distribution)=s(sμ(s)Pπ(ss))x(s)2=sμ(s)x(s)2=xμ2(μPπ=μ).\begin{aligned} \norm{P_\policy x}_\mu^2 &= \sum_s \mu(s)\Big(\sum_{s'}P_\policy(s'\mid s)\,x(s')\Big)^2 && \text{(definition)} \\ &\le \sum_s \mu(s)\sum_{s'}P_\policy(s'\mid s)\,x(s')^2 && \text{(Jensen; }P_\policy(\cdot\mid s)\text{ a distribution)} \\ &= \sum_{s'}\Big(\sum_s \mu(s)P_\policy(s'\mid s)\Big)x(s')^2 = \sum_{s'}\mu(s')\,x(s')^2 = \norm{x}_\mu^2 && \text{(}\mu^\top P_\policy = \mu^\top\text{).} \end{aligned}

Since TπxTπy=γPπ(xy)\bellman^\policy x - \bellman^\policy y = \discount P_\policy(x-y), this gives TπxTπyμγxyμ\norm{\bellman^\policy x - \bellman^\policy y}_\mu \le \discount\norm{x-y}_\mu. (ii) Πμ\Pi_\mu is an orthogonal projection in ,μ\langle\cdot,\cdot\rangle_\mu, hence nonexpansive. So ΠμTπ\Pi_\mu\bellman^\policy is a γ\discount-contraction and has a unique fixed point VθV_{\theta^\star} (Banach, Ch. 1). For the bound, since Tπvπ=vπ\bellman^\policy\valuefn_\policy = \valuefn_\policy we have Πμvπ=ΠμTπvπ\Pi_\mu\valuefn_\policy = \Pi_\mu\bellman^\policy\valuefn_\policy, so

VθΠμvπμ=ΠμTπVθΠμTπvπμγVθvπμ.\norm{V_{\theta^\star} - \Pi_\mu\valuefn_\policy}_\mu = \norm{\Pi_\mu\bellman^\policy V_{\theta^\star} - \Pi_\mu\bellman^\policy\valuefn_\policy}_\mu \le \discount\norm{V_{\theta^\star} - \valuefn_\policy}_\mu .

With the triangle inequality VθvπμVθΠμvπμ+Πμvπvπμ\norm{V_{\theta^\star}-\valuefn_\policy}_\mu \le \norm{V_{\theta^\star}-\Pi_\mu\valuefn_\policy}_\mu + \norm{\Pi_\mu\valuefn_\policy - \valuefn_\policy}_\mu, substitute and solve to get the stated bound. \qquad\blacksquare

Two readings. First, convergence is conditional: it rests on μ\mu being the on-policy distribution, the one fact that makes PπP_\policy nonexpansive.

Second, the fixed point moved: even at convergence the error is the projection error Πμvπvπμ\norm{\Pi_\mu\valuefn_\policy - \valuefn_\policy}_\mu — what the feature space cannot represent — blown up by 1/(1γ)1/(1-\discount). With a table the projection error is zero and we recover vπ\valuefn_\policy exactly; with a poor feature space the fixed point can be far from vπ\valuefn_\policy. Tsitsiklis and Van Roy Tsitsiklis & Van Roy (1997) established exactly this convergence and bound for on-policy linear TD.

The deadly triad

Proposition 5.1 needed all three of its ingredients to be benign. Drop the on-policy assumption and the guarantee collapses. The instability requires the conjunction of three elements, each harmless alone — the deadly triad Sutton & Barto (2018) :

  1. Function approximation — a value model with fewer parameters than states.
  2. Bootstrapping — TD/DP targets that reuse current estimates (vs. Monte Carlo’s full returns).
  3. Off-policy training — updating from a distribution other than the policy’s own.

Any two are safe: tabular off-policy bootstrapping is Q-learning (Chapter 4, convergent); on-policy bootstrapping with approximation is Proposition 5.1; Monte Carlo with approximation and off-policy data is a genuine gradient method and is stable. All three together can make ΠμTπ\Pi_\mu\bellman^\policy — now with μ\mu the wrong (off-policy) distribution — an expansion, and the parameters diverge to infinity. Baird Baird (1995) built the canonical witness.

Example 5.1 (Baird's counterexample).

Seven states, a linear feature map of dimension eight, and zero reward everywhere — so vπ=0\valuefn_\policy = 0 is trivially representable (θ=0\theta = 0). The target policy always takes the “solid” action (to the seventh state); the behaviour policy explores all states (off-policy). Off-policy semi-gradient TD with uniform state weighting applies a linear update θk+1=(I+αA)θk\theta_{k+1} = (I + \alpha A)\theta_k whose matrix AA has an eigenvalue with positive real part. The weights diverge geometrically even though a perfect, zero-error solution sits at the origin.

Eight weight trajectories all growing in magnitude without bound over sweeps, on a log-scaled vertical axis.
Off-policy semi-gradient TD on Baird's counterexample (gamma = 0.99): every component of the linear weight vector diverges, and the value-error norm grows without bound, although the exact solution theta = 0 has zero error. Produced by experiments/python/week05/baird.py.

The lesson is not that approximation is hopeless but that the interaction is what bites: the projected operator’s contraction was an on-policy privilege, and off-policy data revokes it. The next chapter’s engineering — target networks and replay — exists precisely to tame this instability well enough to train deep off-policy value functions in practice.

Beyond semi-gradient: LSTD and true-gradient methods

Two routes around the difficulty are worth naming. Least-squares TD (LSTD) solves the linear TD fixed-point equations directly: accumulate A=tϕ(St)(ϕ(St)γϕ(St+1))A = \sum_t \phi(S_t)\big( \phi(S_t) - \discount\phi(S_{t+1})\big)^\top and b=tϕ(St)Rt+1b = \sum_t \phi(S_t)R_{t+1}, then set θ=A1b\theta^\star = A^{-1}b — no step size, far more sample-efficient per datum, converging to the same on-policy fixed point as Proposition 5.1. Gradient-TD methods (GTD, TDC) instead descend a true objective (the projected Bellman error), restoring convergence even off-policy at the cost of a second set of weights. Both are the subject of the roadmap’s extension; LSTD appears in the companion.

The dynamic-programming bridge

Function approximation is the first crack in the contraction spine. Chapters 1–4 chased the unique fixed point of a γ\discount-contraction; here the object becomes ΠμTπ\Pi_\mu\bellman^\policy, and its fixed point both moves (by the projection error) and, off-policy, may not exist as a stable limit at all. Two bridges forward:

  • To control (LQR, Ch. 13). The linear-quadratic regulator is the lucky case where the optimal value is exactly quadratic — representable with zero projection error — so the approximation never bites and the Riccati recursion converges cleanly. Most of control’s tractable cases are exactly this: a function class the true value lives inside.
  • To deep RL (Week 6). DQN keeps bootstrapping, off-policy data, and a (nonlinear) approximator — the full triad — and survives by engineering the instability down: a slowly-updated target network freezes the bootstrap target, and experience replay decorrelates and re-weights the update distribution toward something tamer.

What’s next

  • Week 6 builds DQN: the replay buffer and target network as direct countermeasures to the deadly triad, scaling approximate value learning to pixels.

Exercises

  1. (Derive) Write the semi-gradient TD(0) update for linear VθV_\theta and show it is not the gradient of 12E[δt2]\tfrac12\E[\delta_t^2]. Identify the missing term.

    Solution

    θ12δt2=δtθδt=δt(γϕ(St+1)ϕ(St))\nabla_\theta\tfrac12\delta_t^2 = \delta_t\nabla_\theta\delta_t = \delta_t\big( \discount\phi(S_{t+1}) - \phi(S_t)\big). Semi-gradient TD uses δtϕ(St)-\delta_t\phi(S_t) (ascending δtϕ(St)\delta_t\phi(S_t)), dropping the γδtϕ(St+1)\discount\delta_t\phi(S_{t+1}) term from differentiating the target. The dropped term is exactly what would make it a true gradient — and what, when present (residual-gradient methods), changes the fixed point and the stability.

  2. (Prove) Show ΠμTπ\Pi_\mu\bellman^\policy is a γ\discount-contraction in μ\norm{\cdot}_\mu when μ\mu is the on-policy distribution, and derive the projection-error bound (Prop. 5.1).

    Solution

    Tπ\bellman^\policy is a γ\discount-contraction in μ\norm{\cdot}_\mu because PπP_\policy is nonexpansive under the stationary μ\mu (Jensen + μPπ=μ\mu^\top P_\policy = \mu^\top); Πμ\Pi_\mu is nonexpansive as an orthogonal projection; the composition is a γ\discount-contraction. The bound follows from Πμvπ=ΠμTπvπ\Pi_\mu\valuefn_\policy = \Pi_\mu\bellman^\policy\valuefn_\policy and the triangle inequality, as in the proof above.

  3. (Compute) In Baird’s setup the update is θk+1=(I+αA)θk\theta_{k+1} = (I + \alpha A)\theta_k. What property of AA (or of I+αAI + \alpha A) causes divergence, and why does a small α\alpha not prevent it?

    Solution

    Divergence occurs iff I+αAI + \alpha A has spectral radius >1> 1, i.e. AA has an eigenvalue with positive real part (for small α>0\alpha>0, ρ(I+αA)>1\rho(I+\alpha A)>1 exactly then). Shrinking α\alpha only slows the geometric blow-up; it cannot flip the sign of the unstable eigenvalue. The companion computes AA‘s spectrum.

  4. (Implement) Verify on the companion that on-policy semi-gradient TD converges to (near) vπ\valuefn_\policy on the random walk, while off-policy semi-gradient TD on Baird’s counterexample diverges (the committed figure).

    Solution

    See experiments/python/week05/test_fa.py: on-policy semi-gradient TD with tabular-complete features matches dp.policy_evaluation within a sampling tolerance and keeps bounded weights; Baird’s off-policy update grows the weight norm without bound (the divergence the figure plots).

  5. (Extend) Implement LSTD and confirm it recovers the same on-policy fixed point as semi-gradient TD (and vπ\valuefn_\policy exactly when the features are tabular-complete).

    Solution

    LSTD solves Aθ=bA\theta = b with A=tϕ(St)(ϕ(St)γϕ(St+1))A = \sum_t\phi(S_t)(\phi(S_t)-\discount\phi(S_{t+1}))^\top, b=tϕ(St)Rt+1b=\sum_t\phi(S_t)R_{t+1}; its solution is the TD fixed point. With tabular-complete features the projection error is zero and θ\theta^\star reproduces vπ\valuefn_\policy, which the companion’s lstd asserts against the Week-1 linear solve.

Companion code

The Week-5 companion lives at experiments/python/week05/ and reuses the Week-3 random walk and the Week-1 dp.policy_evaluation oracle. It is the chapter’s two poles side by side: a convergent on-policy method and a divergent off-policy one. (The roadmap suggests tile coding on MountainCar-v0; the random walk is used here because it has a closed-form vπ\valuefn_\policy to test against — MountainCar tile-coding is a natural deferred showcase that lacks an exact oracle.)

  • fa.py — linear semi_gradient_td (constant step size with Polyak–Ruppert averaging) and lstd for a generic (P, R, policy) with an arbitrary feature matrix Φ\Phi. Pure NumPy.
  • baird.py — Baird’s seven-state counterexample: builds the feature matrix and the off-policy expected-update matrix AA, iterates to divergence, and (with --plot) renders the committed divergence figure.
  • test_fa.py — mathematical-correctness tests: on-policy semi-gradient TD converges to the analytic vπ\valuefn_\policy (and stays bounded); LSTD recovers vπ\valuefn_\policy with tabular-complete features; and Baird’s off-policy update diverges (the weight norm grows without bound while AA has an unstable eigenvalue).
# core algorithms + correctness tests (pure NumPy, no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week05/test_fa.py -q

# regenerate the committed Baird divergence figure
PYTHONPATH=. python experiments/python/week05/baird.py --plot