Part I · Foundations Week 4 Published td.py test_td.py

Temporal-Difference Learning: TD(0), SARSA, and Q-Learning

Bootstrapping from sampled transitions: the TD(0) prediction update as a stochastic Euler step toward the Bellman fixed point, SARSA as on-policy control, and Q-learning as off-policy control. Where temporal-difference learning sits between Monte Carlo and dynamic programming, and what the SARSA/Q-learning split reveals on the cliff.

On this page
  1. The TD(0) update
  2. TD(0) as a stochastic Euler step
  3. SARSA: on-policy control
  4. Q-learning: off-policy control
  5. On-policy vs off-policy: the cliff
  6. n-step TD and the bias–variance dial
  7. The dynamic-programming bridge
  8. What’s next
  9. Exercises
  10. Companion code

Temporal-Difference Learning: TD(0), SARSA, and Q-Learning

Where we are. Monte Carlo (Chapter 3) waited for a full episode to terminate, then averaged the realized return — unbiased, but high-variance and episodic-only. Dynamic programming (Chapters 1–2) never sampled at all; it bootstrapped each value from other current estimates through the model. Temporal-difference learning is the missing synthesis: sample one transition like Monte Carlo, and bootstrap from the current estimate like dynamic programming. The result learns online, from a single step, with no model and no wait for the episode to end. Its load-bearing object is the TD error δt=Rt+1+γV(St+1)V(St)\delta_t = R_{t+1} + \discount V(S_{t+1}) - V(S_t) — a one-sample estimate of the Bellman residual.

The TD(0) update

Given a single transition (St,Rt+1,St+1)(S_t, R_{t+1}, S_{t+1}) generated under π\policy, TD(0) nudges the value of the visited state toward a bootstrapped target:

V(St)    V(St)+α[Rt+1+γV(St+1)TD targetV(St)]=V(St)+αδt,V(S_t) \;\leftarrow\; V(S_t) + \alpha\,\big[\, \underbrace{R_{t+1} + \discount V(S_{t+1})}_{\text{TD target}} - V(S_t) \,\big] = V(S_t) + \alpha\,\delta_t ,

with step size α(0,1]\alpha \in (0,1]. Compare the three targets we have now for the same quantity vπ(St)\valuefn_\policy(S_t): Monte Carlo uses the full sampled return GtG_t (no bootstrap, needs the episode to end); dynamic programming uses the model expectation (TπV)(St)(\bellman^\policy V)(S_t) (full bootstrap, needs the model); TD(0) uses Rt+1+γV(St+1)R_{t+1} + \discount V(S_{t+1})one sampled successor plus a bootstrap off the current estimate.

It updates every step, online, and never references p\transition.

TD(0) as a stochastic Euler step

Why should nudging toward a bootstrapped (hence biased, while VvπV \ne \valuefn_\policy) target converge to the right answer? Because in expectation the nudge is a Bellman-operator step.

Proposition 4.1 (Expected TD(0) update).

For any value estimate VV and any state ss, the expected TD error under π\policy is the Bellman expectation residual,

Eπ ⁣[δtSt=s]=(TπV)(s)V(s).\E_\policy\!\big[\,\delta_t \,\big|\, S_t = s\,\big] = (\bellman^\policy V)(s) - V(s).

Hence the expected TD(0) update moves V(s)V(s) a fraction α\alpha of the way toward (TπV)(s)(\bellman^\policy V)(s).

Proof.

Condition on St=sS_t = s and average over the action and transition drawn under π\policy:

Eπ[δtSt=s]=Eπ[Rt+1+γV(St+1)St=s]V(s)(V(s) is deterministic given s)=aπ(as)s,rp(s,rs,a)[r+γV(s)]V(s)(expand the expectation)=(TπV)(s)V(s)(definition of Tπ, Ch. 1 Def. 1.4).\begin{aligned} \E_\policy[\delta_t \mid S_t = s] &= \E_\policy[\, R_{t+1} + \discount V(S_{t+1}) \mid S_t = s\,] - V(s) && \text{($V(s)$ is deterministic given $s$)} \\ &= \sum_a \policy(a\mid s)\sum_{s',r}\transition(s',r\mid s,a)\big[r + \discount V(s')\big] - V(s) && \text{(expand the expectation)} \\ &= (\bellman^\policy V)(s) - V(s) && \text{(definition of }\bellman^\policy\text{, Ch. 1 Def. 1.4).} \end{aligned}

So E[V(s)+αδt]=V(s)+α[(TπV)(s)V(s)]\E[\,V(s) + \alpha\delta_t\,] = V(s) + \alpha\big[(\bellman^\policy V)(s) - V(s)\big]. \qquad\blacksquare

Read the deterministic part as a forward-Euler step on the ODE V˙=TπVV\dot V = \bellman^\policy V - V, whose unique equilibrium is the fixed point Tπvπ=vπ\bellman^\policy \valuefn_\policy = \valuefn_\policy — that is, vπ\valuefn_\policy.

TD(0) follows this flow with stochastic targets, so it is a Robbins–Monro stochastic-approximation scheme: under the step-size conditions tαt=\sum_t \alpha_t = \infty, tαt2<\sum_t \alpha_t^2 < \infty and every state visited infinitely often, VvπV \to \valuefn_\policy Sutton & Barto (2018) Bertsekas (2017) . The contraction of Chapter 1 supplies the stability the ODE needs; sampling supplies the variance that the diminishing αt\alpha_t averages away.

SARSA: on-policy control

Prediction estimates vπ\valuefn_\policy; control needs action-values. SARSA forms the same TD error on QQ from the quintuple (St,At,Rt+1,St+1,At+1)(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}) — its namesake — where At+1A_{t+1} is the action the agent actually takes next under its (typically ε\varepsilon-greedy) policy:

Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)].Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha\big[\,R_{t+1} + \discount Q(S_{t+1},A_{t+1}) - Q(S_t,A_t)\,\big].

Because the bootstrap uses the value of the sampled next action, SARSA is on-policy: it evaluates and improves the very policy it follows. Interleaving this evaluation with ε\varepsilon-greedy improvement is generalized policy iteration (Chapter 1) driven by sampled errors; under GLIE (greedy in the limit with infinite exploration — ε0\varepsilon \to 0 slowly) it converges to q\optqfn.

Q-learning: off-policy control

Q-learning Watkins & Dayan (1992) changes one symbol — the next action is replaced by the greedy one:

Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)].Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha\big[\,R_{t+1} + \discount \max_{a'} Q(S_{t+1},a') - Q(S_t,A_t)\,\big].

That max\max makes the target the sampled optimality backup, so Q-learning learns q\optqfn directly while behaving by any sufficiently exploratory policy — it is off-policy. Watkins and Dayan proved QqQ \to \optqfn with probability one under the Robbins–Monro step sizes and infinite visits to every state–action pair. The contrast with SARSA is exactly the contrast between the two operators of Chapter 1, now on action-values.

Proposition 4.2 (SARSA and Q-learning target the two Bellman operators).

With the action-value operators (TπQ)(s,a)=s,rp(s,rs,a)[r+γaπ(as)Q(s,a)](\bellman^\policy Q)(s,a) = \sum_{s',r}\transition(s',r\mid s,a)[r + \discount\sum_{a'}\policy(a'\mid s')Q(s',a')] and (TQ)(s,a)=s,rp(s,rs,a)[r+γmaxaQ(s,a)](\bellmanopt Q)(s,a) = \sum_{s',r}\transition(s',r\mid s,a)[r + \discount\max_{a'}Q(s',a')], the expected one-step targets are

E[SARSA targetSt=s,At=a]=(TπQ)(s,a),E[Q-learning targetSt=s,At=a]=(TQ)(s,a).\E[\,\text{SARSA target}\mid S_t=s,A_t=a] = (\bellman^\policy Q)(s,a), \qquad \E[\,\text{Q-learning target}\mid S_t=s,A_t=a] = (\bellmanopt Q)(s,a).

So SARSA’s fixed point is qπ\qfn_\policy (for the policy it follows) and Q-learning’s is q\optqfn.

Proof.

Average each target over (St+1,Rt+1)(S_{t+1},R_{t+1}) drawn from p(s,a)\transition(\cdot\mid s,a). SARSA additionally averages At+1π(St+1)A_{t+1}\sim\policy(\cdot\mid S_{t+1}), producing aπ(as)Q(s,a)\sum_{a'}\policy(a'\mid s')Q(s',a') inside the bracket — exactly Tπ\bellman^\policy. Q-learning takes maxaQ(s,a)\max_{a'}Q(s',a') deterministically, producing T\bellmanopt. A stochastic-approximation step toward a γ\discount-contraction’s value converges to its unique fixed point (Prop. 4.1’s argument, now for the action-value operators): qπ\qfn_\policy for Tπ\bellman^\policy, q\optqfn for T\bellmanopt. \qquad\blacksquare

On-policy vs off-policy: the cliff

The split is not academic. On the cliff-walking gridworld — a corridor whose bottom edge is a cliff that costs a large penalty and resets the agent — an optimal agent walks right along the cliff’s lip, the shortest path. Q-learning, estimating q\optqfn, learns exactly that risky-optimal route. SARSA, estimating qπ\qfn_\policy for the ε\varepsilon-greedy policy it actually follows, accounts for the fact that exploration will occasionally step it off the cliff, so it prefers a safer path one row up.

Both are correct about different questions — and at a fixed ε\varepsilon, the start-state value Q-learning reports is the higher (optimal) one, while SARSA’s is lower because it embeds the cost of exploration. The companion measures exactly this gap.

n-step TD and the bias–variance dial

TD(0) and Monte Carlo are the endpoints of one family. The nn-step return Gt:t+n=Rt+1+γRt+2++γn1Rt+n+γnV(St+n)G_{t:t+n} = R_{t+1} + \discount R_{t+2} + \cdots + \discount^{n-1}R_{t+n} + \discount^{n} V(S_{t+n}) bootstraps after nn sampled rewards; n=1n=1 is TD(0), n=n=\infty is Monte Carlo. Larger nn samples more (higher variance, less bootstrap bias); smaller nn bootstraps more (lower variance, more bias while VV is wrong). Intermediate nn — and its geometric average TD(λ\lambda) — usually beats both endpoints, the bias–variance dial made adjustable.

The dynamic-programming bridge

Temporal-difference learning closes a loop opened in Chapter 2. Real-time DP applied asynchronous Bellman backups along trajectories, but using the model expectation sp(ss,a)\sum_{s'}\transition(s'\mid s,a). Replace that expectation with a single sampled successor and the deterministic backup becomes the stochastic TD update — RTDP’s “Monte-Carlo cousin,” promised in Chapter 2, is precisely Q-learning. The three methods now form a clean table on two axes:

  • Dynamic programming — bootstrap, no sample (model expectation, full sweep).
  • Monte Carlo — sample, no bootstrap (full return, episode end).
  • Temporal difference — sample and bootstrap (one transition, online).

All three chase the same fixed point of the same γ\discount-contraction; they differ only in how much they sample and how much they bootstrap.

What’s next

  • Week 5 replaces the value table with a parametric approximator VθV_\theta. Bootstrapping (this chapter), off-policy data (Q-learning), and function approximation together form the deadly triad — the first place the clean contraction story breaks, because approximation moves the fixed point itself.

Exercises

  1. (Derive) Show Eπ[δtSt=s]=(TπV)(s)V(s)\E_\policy[\delta_t \mid S_t = s] = (\bellman^\policy V)(s) - V(s) for any VV, and conclude TD(0) is a stochastic step toward vπ\valuefn_\policy (Prop. 4.1).

    Solution

    Eπ[δtSt=s]=Eπ[Rt+1+γV(St+1)St=s]V(s)=aπ(as)s,rp(s,rs,a)[r+γV(s)]V(s)=(TπV)(s)V(s)\E_\policy[\delta_t\mid S_t=s] = \E_\policy[R_{t+1}+\discount V(S_{t+1})\mid S_t=s] - V(s) = \sum_a\policy(a\mid s)\sum_{s',r}\transition(s',r\mid s,a)[r+\discount V(s')] - V(s) = (\bellman^\policy V)(s) - V(s). The expected update is therefore V(s)+α[(TπV)(s)V(s)]V(s) + \alpha[(\bellman^\policy V)(s) - V(s)], a damped step of the γ\discount-contraction whose fixed point is vπ\valuefn_\policy.

  2. (Prove) Show the expected Q-learning target is (TQ)(s,a)(\bellmanopt Q)(s,a) and the expected SARSA target is (TπQ)(s,a)(\bellman^\policy Q)(s,a), hence their fixed points are q\optqfn and qπ\qfn_\policy (Prop. 4.2).

    Solution

    Averaging over (St+1,Rt+1)p(s,a)(S_{t+1},R_{t+1})\sim\transition(\cdot\mid s,a): Q-learning’s maxaQ(St+1,a)\max_{a'}Q(S_{t+1},a') gives s,rp[r+γmaxaQ(s,a)]=(TQ)(s,a)\sum_{s',r}\transition[r+\discount\max_{a'}Q(s',a')] = (\bellmanopt Q)(s,a). SARSA additionally averages At+1π(St+1)A_{t+1}\sim\policy(\cdot\mid S_{t+1}), giving s,rp[r+γaπ(as)Q(s,a)]=(TπQ)(s,a)\sum_{s',r}\transition[r+\discount\sum_{a'}\policy(a'\mid s')Q(s',a')] = (\bellman^\policy Q)(s,a). The fixed points follow from the contraction of each operator.

  3. (Compute) With γ=0.9\discount = 0.9, α=0.1\alpha = 0.1, current V(s)=2V(s)=2, observed Rt+1=1R_{t+1}=1, V(s)=3V(s')=3: compute δt\delta_t and the updated V(s)V(s).

    Solution

    δt=1+0.932=1.7\delta_t = 1 + 0.9\cdot 3 - 2 = 1.7; V(s)2+0.11.7=2.17V(s)\leftarrow 2 + 0.1\cdot 1.7 = 2.17. The same numbers drive a SARSA or Q-learning update with Q(s,)Q(s',\cdot) in place of V(s)V(s') (the sampled-action value for SARSA, the max for Q-learning).

  4. (Implement) In the companion, verify TD(0) converges to the analytic vπ\valuefn_\policy on the random walk; that Q-learning (GLIE) recovers the optimal cliff-edge policy while SARSA settles on the safer, slightly longer route; and that at a fixed ε\varepsilon Q-learning’s start-state value is at least SARSA’s.

    Solution

    See experiments/python/week04/test_td.py: TD(0) matches dp.policy_evaluation within a sampling tolerance; Q-learning’s greedy policy attains the optimal start-state value from dp.value_iteration while SARSA’s greedy is a sensible but suboptimal goal-reaching route (the safe path), so the off-policy greedy value is \ge the on-policy one; and the fixed-ε\varepsilon start-state estimate of Q-learning is \ge SARSA’s — the optimism/realism gap.

  5. (Extend) Sweep nn in nn-step TD on the random walk and reproduce the characteristic U-shaped error-vs-nn curve (an intermediate nn beats both TD(0) and Monte Carlo).

    Solution

    Bootstrapping bias falls with nn while sampling variance rises; the sum is minimized at intermediate nn. The qualitative U-curve over nn (for a fixed α\alpha) is the standard Sutton & Barto result; the companion’s --nstep sweep reproduces it.

Companion code

The Week-4 companion lives at experiments/python/week04/ and reuses earlier weeks: the random walk from Week 3 (randomwalk.py) for prediction, and the Week-1 dp solvers as the exact oracle for both prediction and control.

  • cliffwalk.py — a self-contained cliff-walking MDP (the Sutton & Barto Example 6.6 layout: a cliff row that penalizes and resets) as a generic (P, R) array pair, so v\optvaluefn and the optimal policy come from dp.value_iteration.
  • td.py — the transition sampler plus td0_prediction, sarsa, and q_learning, all operating on a generic (P, R, terminals), with optional GLIE schedules (visit-count step sizes, decaying ε\varepsilon). Pure NumPy.
  • test_td.py — mathematical-correctness tests: TD(0) converges to the analytic vπ\valuefn_\policy; under GLIE Q-learning recovers the optimal start-state value while SARSA learns the safer, slightly longer route; and Q-learning’s fixed-ε\varepsilon start value is at least SARSA’s (the cliff’s on-/off-policy gap).
# core algorithms + correctness tests (pure NumPy, no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week04/test_td.py -q

# worked SARSA-vs-Q-learning comparison on the cliff (prints both learned paths)
PYTHONPATH=. python experiments/python/week04/cliffwalk.py