Part I · Foundations Week 5 Published fa.py test_fa.py

Function Approximation and the Deadly Triad

Replacing the value table with a parametric approximator: linear value functions, semi-gradient TD, and the projected Bellman operator. Why on-policy semi-gradient TD converges to a bounded-error fixed point, and why function approximation, bootstrapping, and off-policy training together can diverge — the deadly triad, witnessed by Baird's counterexample.

On this page

Linear approximation and semi-gradient TD
The projected Bellman operator
When it works: on-policy convergence
The deadly triad
Beyond semi-gradient: LSTD and true-gradient methods
The dynamic-programming bridge
What’s next
Exercises
Companion code

Function Approximation and the Deadly Triad

Where we are. Every method so far stored one number per state in a table. That ends the moment the state space is large or continuous: we must approximate the value function with a parametric model $V_\theta$ and let it generalize across states. This chapter is the first where the clean contraction story of Chapters 1–4 breaks. Two results frame it. On-policy, semi-gradient TD still converges — but to a projected fixed point whose error is the representation’s projection error, amplified by $1/(1-\discount)$ : approximation changes the fixed point itself. Off-policy, the same update can diverge. The combination that breaks is named the deadly triad — function approximation, bootstrapping, and off-policy training — and Baird’s counterexample is its sharpest witness.

Chapter 5 — at a glance

Goal. Define linear value-function approximation and the semi-gradient TD update; read the TD fixed point as the fixed point of a projected Bellman operator; prove on-policy convergence with a projection-error bound; and see why the deadly triad diverges, on Baird’s counterexample.

Reading time. ~40 minutes; ~65 with the proofs and exercises.

Key insight — the DP bridge. With a table, $\valuefn_\policy$ was the fixed point of a $\discount$ -contraction. With approximation it is replaced by the fixed point of $\Pi\bellman^\policy$ — the Bellman operator followed by a projection back onto the representable subspace. On-policy, that projection is nonexpansive in the right norm and the composition stays a contraction; off-policy, the projection is taken in the wrong norm and the composition can expand. The discount that guaranteed convergence in Chapter 1 is no longer enough on its own.

Linear approximation and semi-gradient TD

Replace the table $V \in \R^{\statespace}$ with a parametric $V_\theta$ . We focus on the linear case, where a feature map $\phi:\statespace\to\R^d$ ( $d \ll \lvert\statespace\rvert$ ) gives

V_\theta(s) = \theta^\top \phi(s), \qquad \theta \in \R^d .

The representable value functions form a $d$ -dimensional subspace $\{\Phi\theta : \theta\in\R^d\}$ of $\R^{\statespace}$ , where $\Phi$ stacks the feature rows. Learning now adjusts $\theta$ , and a value learned at one state moves the values of all states sharing features — generalization, the whole point.

Definition 5.1 (Semi-gradient TD(0)).

Given a transition $(S_t, R_{t+1}, S_{t+1})$ , semi-gradient TD(0) updates

\theta \;\leftarrow\; \theta + \alpha\,\delta_t\,\phi(S_t), \qquad \delta_t = R_{t+1} + \discount\,\theta^\top\phi(S_{t+1}) - \theta^\top\phi(S_t).

It is called semi-gradient because it treats the bootstrap target $\discount\,\theta^\top\phi(S_{t+1})$ as fixed — it does not differentiate the target with respect to $\theta$ , so the update is not the true gradient of any fixed objective.

That last point is the crux of the chapter.

A true-gradient method on a fixed loss cannot diverge to infinity; semi-gradient TD can, because the “target” it descends toward moves with

\theta

The projected Bellman operator

Where does semi-gradient TD converge, when it does? Not to $\bellman^\policy V_\theta$ — that generally leaves the representable subspace and is not expressible as any $\theta$ . The update implicitly projects it back.

Definition 5.2 (Projection and the TD fixed point).

Let $\mu$ be a distribution over states and $\langle x,y\rangle_\mu = \sum_s \mu(s)x(s)y(s)$ the weighted inner product, with norm $\norm{\cdot}_\mu$ . The projection $\Pi_\mu$ maps a value function to the nearest representable one, $\Pi_\mu x = \arg\min_{\theta}\norm{\Phi\theta - x}_\mu$ . The TD fixed point is the $V_\theta$ satisfying

V_\theta = \Pi_\mu\,\bellman^\policy V_\theta ,

the fixed point of the composed projected Bellman operator $\Pi_\mu\bellman^\policy$ .

Semi-gradient TD is a stochastic-approximation scheme (Chapter 4) for this fixed point: in expectation, under the state-visitation distribution $\mu$ its update drives $\theta$ toward $\Pi_\mu\bellman^\policy V_\theta$ . Whether that iteration converges hinges entirely on whether $\Pi_\mu\bellman^\policy$ is a contraction — which depends on $\mu$ .

When it works: on-policy convergence

Proposition 5.1 (On-policy convergence and the projection-error bound).

Let $\mu$ be the on-policy stationary distribution of $\policy$ (so $\mu^\top P_\policy = \mu^\top$ ). Then $\bellman^\policy$ is a $\discount$ -contraction in $\norm{\cdot}_\mu$ , and $\Pi_\mu$ is nonexpansive in $\norm{\cdot}_\mu$ , so $\Pi_\mu\bellman^\policy$ is a $\discount$ -contraction with a unique fixed point $V_{\theta^\star}$ . Its error is the projection error, amplified by the horizon:

\norm{V_{\theta^\star} - \valuefn_\policy}_\mu \;\le\; \frac{1}{1-\discount}\, \norm{\Pi_\mu \valuefn_\policy - \valuefn_\policy}_\mu .

Proof.

Two facts compose. (i) Under the stationary $\mu$ , the transition operator is nonexpansive in $\norm{\cdot}_\mu$ :

\begin{aligned} \norm{P_\policy x}_\mu^2 &= \sum_s \mu(s)\Big(\sum_{s'}P_\policy(s'\mid s)\,x(s')\Big)^2 && \text{(definition)} \\ &\le \sum_s \mu(s)\sum_{s'}P_\policy(s'\mid s)\,x(s')^2 && \text{(Jensen; }P_\policy(\cdot\mid s)\text{ a distribution)} \\ &= \sum_{s'}\Big(\sum_s \mu(s)P_\policy(s'\mid s)\Big)x(s')^2 = \sum_{s'}\mu(s')\,x(s')^2 = \norm{x}_\mu^2 && \text{(}\mu^\top P_\policy = \mu^\top\text{).} \end{aligned}

Since $\bellman^\policy x - \bellman^\policy y = \discount P_\policy(x-y)$ , this gives $\norm{\bellman^\policy x - \bellman^\policy y}_\mu \le \discount\norm{x-y}_\mu$ . (ii) $\Pi_\mu$ is an orthogonal projection in $\langle\cdot,\cdot\rangle_\mu$ , hence nonexpansive. So $\Pi_\mu\bellman^\policy$ is a $\discount$ -contraction and has a unique fixed point $V_{\theta^\star}$ (Banach, Ch. 1). For the bound, since $\bellman^\policy\valuefn_\policy = \valuefn_\policy$ we have $\Pi_\mu\valuefn_\policy = \Pi_\mu\bellman^\policy\valuefn_\policy$ , so

\norm{V_{\theta^\star} - \Pi_\mu\valuefn_\policy}_\mu = \norm{\Pi_\mu\bellman^\policy V_{\theta^\star} - \Pi_\mu\bellman^\policy\valuefn_\policy}_\mu \le \discount\norm{V_{\theta^\star} - \valuefn_\policy}_\mu .

With the triangle inequality $\norm{V_{\theta^\star}-\valuefn_\policy}_\mu \le \norm{V_{\theta^\star}-\Pi_\mu\valuefn_\policy}_\mu + \norm{\Pi_\mu\valuefn_\policy - \valuefn_\policy}_\mu$ , substitute and solve to get the stated bound. $\qquad\blacksquare$

Two readings. First, convergence is conditional: it rests on $\mu$ being the on-policy distribution, the one fact that makes $P_\policy$ nonexpansive.

Second, the fixed point moved: even at convergence the error is the projection error

\norm{\Pi_\mu\valuefn_\policy - \valuefn_\policy}_\mu

— what the feature space cannot represent — blown up by

1/(1-\discount)

. With a table the projection error is zero and we recover

\valuefn_\policy

exactly; with a poor feature space the fixed point can be far from

\valuefn_\policy

. Tsitsiklis and Van Roy Tsitsiklis & Van Roy (1997) established exactly this convergence and bound for on-policy linear TD.

The deadly triad

Proposition 5.1 needed all three of its ingredients to be benign. Drop the on-policy assumption and the guarantee collapses. The instability requires the conjunction of three elements, each harmless alone — the deadly triad Sutton & Barto (2018) :

Function approximation — a value model with fewer parameters than states.
Bootstrapping — TD/DP targets that reuse current estimates (vs. Monte Carlo’s full returns).
Off-policy training — updating from a distribution other than the policy’s own.

Any two are safe: tabular off-policy bootstrapping is Q-learning (Chapter 4, convergent); on-policy bootstrapping with approximation is Proposition 5.1; Monte Carlo with approximation and off-policy data is a genuine gradient method and is stable. All three together can make $\Pi_\mu\bellman^\policy$ — now with $\mu$ the wrong (off-policy) distribution — an expansion, and the parameters diverge to infinity. Baird Baird (1995) built the canonical witness.

Example 5.1 (Baird's counterexample).

Seven states, a linear feature map of dimension eight, and zero reward everywhere — so $\valuefn_\policy = 0$ is trivially representable ( $\theta = 0$ ). The target policy always takes the “solid” action (to the seventh state); the behaviour policy explores all states (off-policy). Off-policy semi-gradient TD with uniform state weighting applies a linear update $\theta_{k+1} = (I + \alpha A)\theta_k$ whose matrix $A$ has an eigenvalue with positive real part. The weights diverge geometrically even though a perfect, zero-error solution sits at the origin.

Eight weight trajectories all growing in magnitude without bound over sweeps, on a log-scaled vertical axis. — Off-policy semi-gradient TD on Baird's counterexample (gamma = 0.99): every component of the linear weight vector diverges, and the value-error norm grows without bound, although the exact solution theta = 0 has zero error. Produced by experiments/python/week05/baird.py.

The lesson is not that approximation is hopeless but that the interaction is what bites: the projected operator’s contraction was an on-policy privilege, and off-policy data revokes it. The next chapter’s engineering — target networks and replay — exists precisely to tame this instability well enough to train deep off-policy value functions in practice.

Beyond semi-gradient: LSTD and true-gradient methods

Two routes around the difficulty are worth naming. Least-squares TD (LSTD) solves the linear TD fixed-point equations directly: accumulate $A = \sum_t \phi(S_t)\big( \phi(S_t) - \discount\phi(S_{t+1})\big)^\top$ and $b = \sum_t \phi(S_t)R_{t+1}$ , then set $\theta^\star = A^{-1}b$ — no step size, far more sample-efficient per datum, converging to the same on-policy fixed point as Proposition 5.1. Gradient-TD methods (GTD, TDC) instead descend a true objective (the projected Bellman error), restoring convergence even off-policy at the cost of a second set of weights. Both are the subject of the roadmap’s extension; LSTD appears in the companion.

The dynamic-programming bridge

Function approximation is the first crack in the contraction spine. Chapters 1–4 chased the unique fixed point of a $\discount$ -contraction; here the object becomes $\Pi_\mu\bellman^\policy$ , and its fixed point both moves (by the projection error) and, off-policy, may not exist as a stable limit at all. Two bridges forward:

To control (LQR, Ch. 13). The linear-quadratic regulator is the lucky case where the optimal value is exactly quadratic — representable with zero projection error — so the approximation never bites and the Riccati recursion converges cleanly. Most of control’s tractable cases are exactly this: a function class the true value lives inside.
To deep RL (Week 6). DQN keeps bootstrapping, off-policy data, and a (nonlinear) approximator — the full triad — and survives by engineering the instability down: a slowly-updated target network freezes the bootstrap target, and experience replay decorrelates and re-weights the update distribution toward something tamer.

What’s next

Week 6 builds DQN: the replay buffer and target network as direct countermeasures to the deadly triad, scaling approximate value learning to pixels.

Exercises

(Derive) Write the semi-gradient TD(0) update for linear $V_\theta$ and show it is not the gradient of $\tfrac12\E[\delta_t^2]$ . Identify the missing term.

Solution
$\nabla_\theta\tfrac12\delta_t^2 = \delta_t\nabla_\theta\delta_t = \delta_t\big( \discount\phi(S_{t+1}) - \phi(S_t)\big)$ . Semi-gradient TD uses $-\delta_t\phi(S_t)$ (ascending $\delta_t\phi(S_t)$ ), dropping the $\discount\delta_t\phi(S_{t+1})$ term from differentiating the target. The dropped term is exactly what would make it a true gradient — and what, when present (residual-gradient methods), changes the fixed point and the stability.
(Prove) Show $\Pi_\mu\bellman^\policy$ is a $\discount$ -contraction in $\norm{\cdot}_\mu$ when $\mu$ is the on-policy distribution, and derive the projection-error bound (Prop. 5.1).

Solution
$\bellman^\policy$ is a $\discount$ -contraction in $\norm{\cdot}_\mu$ because $P_\policy$ is nonexpansive under the stationary $\mu$ (Jensen + $\mu^\top P_\policy = \mu^\top$ ); $\Pi_\mu$ is nonexpansive as an orthogonal projection; the composition is a $\discount$ -contraction. The bound follows from $\Pi_\mu\valuefn_\policy = \Pi_\mu\bellman^\policy\valuefn_\policy$ and the triangle inequality, as in the proof above.
(Compute) In Baird’s setup the update is $\theta_{k+1} = (I + \alpha A)\theta_k$ . What property of $A$ (or of $I + \alpha A$ ) causes divergence, and why does a small $\alpha$ not prevent it?

Solution
Divergence occurs iff $I + \alpha A$ has spectral radius $> 1$ , i.e. $A$ has an eigenvalue with positive real part (for small $\alpha>0$ , $\rho(I+\alpha A)>1$ exactly then). Shrinking $\alpha$ only slows the geometric blow-up; it cannot flip the sign of the unstable eigenvalue. The companion computes $A$ ‘s spectrum.
(Implement) Verify on the companion that on-policy semi-gradient TD converges to (near) $\valuefn_\policy$ on the random walk, while off-policy semi-gradient TD on Baird’s counterexample diverges (the committed figure).

Solution
See experiments/python/week05/test_fa.py: on-policy semi-gradient TD with tabular-complete features matches dp.policy_evaluation within a sampling tolerance and keeps bounded weights; Baird’s off-policy update grows the weight norm without bound (the divergence the figure plots).
(Extend) Implement LSTD and confirm it recovers the same on-policy fixed point as semi-gradient TD (and $\valuefn_\policy$ exactly when the features are tabular-complete).

Solution
LSTD solves $A\theta = b$ with $A = \sum_t\phi(S_t)(\phi(S_t)-\discount\phi(S_{t+1}))^\top$ , $b=\sum_t\phi(S_t)R_{t+1}$ ; its solution is the TD fixed point. With tabular-complete features the projection error is zero and $\theta^\star$ reproduces $\valuefn_\policy$ , which the companion’s lstd asserts against the Week-1 linear solve.

Companion code

The Week-5 companion lives at experiments/python/week05/ and reuses the Week-3 random walk and the Week-1 dp.policy_evaluation oracle. It is the chapter’s two poles side by side: a convergent on-policy method and a divergent off-policy one. (The roadmap suggests tile coding on MountainCar-v0; the random walk is used here because it has a closed-form $\valuefn_\policy$ to test against — MountainCar tile-coding is a natural deferred showcase that lacks an exact oracle.)

fa.py — linear semi_gradient_td (constant step size with Polyak–Ruppert averaging) and lstd for a generic (P, R, policy) with an arbitrary feature matrix $\Phi$ . Pure NumPy.
baird.py — Baird’s seven-state counterexample: builds the feature matrix and the off-policy expected-update matrix $A$ , iterates to divergence, and (with --plot) renders the committed divergence figure.
test_fa.py — mathematical-correctness tests: on-policy semi-gradient TD converges to the analytic $\valuefn_\policy$ (and stays bounded); LSTD recovers $\valuefn_\policy$ with tabular-complete features; and Baird’s off-policy update diverges (the weight norm grows without bound while $A$ has an unstable eigenvalue).

# core algorithms + correctness tests (pure NumPy, no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week05/test_fa.py -q

# regenerate the committed Baird divergence figure
PYTHONPATH=. python experiments/python/week05/baird.py --plot