Part I · Foundations Week 4 Published td.py test_td.py

Temporal-Difference Learning: TD(0), SARSA, and Q-Learning

Bootstrapping from sampled transitions: the TD(0) prediction update as a stochastic Euler step toward the Bellman fixed point, SARSA as on-policy control, and Q-learning as off-policy control. Where temporal-difference learning sits between Monte Carlo and dynamic programming, and what the SARSA/Q-learning split reveals on the cliff.

On this page

The TD(0) update
TD(0) as a stochastic Euler step
SARSA: on-policy control
Q-learning: off-policy control
On-policy vs off-policy: the cliff
n-step TD and the bias–variance dial
The dynamic-programming bridge
What’s next
Exercises
Companion code

Temporal-Difference Learning: TD(0), SARSA, and Q-Learning

Where we are. Monte Carlo (Chapter 3) waited for a full episode to terminate, then averaged the realized return — unbiased, but high-variance and episodic-only. Dynamic programming (Chapters 1–2) never sampled at all; it bootstrapped each value from other current estimates through the model. Temporal-difference learning is the missing synthesis: sample one transition like Monte Carlo, and bootstrap from the current estimate like dynamic programming. The result learns online, from a single step, with no model and no wait for the episode to end. Its load-bearing object is the TD error $\delta_t = R_{t+1} + \discount V(S_{t+1}) - V(S_t)$ — a one-sample estimate of the Bellman residual.

Chapter 4 — at a glance

Goal. Read TD(0) as a stochastic-approximation step toward $\valuefn_\policy$ ; derive its expected update as a Bellman-operator step; build SARSA (on-policy) and Q-learning (off-policy) control on the same error; and see the on-/off-policy split made concrete on the cliff.

Reading time. ~35 minutes; ~55 with the proofs and exercises.

Key insight — the DP bridge. TD is the sampled, bootstrapped, asynchronous Bellman backup. Where Chapter 2’s real-time DP applied the model expectation along a trajectory, TD replaces that expectation with a single sampled successor and takes a small step $\alpha$ toward it: $V(S_t) \leftarrow V(S_t) + \alpha\,\delta_t$ . In expectation it is the Bellman operator (the model average), so it inherits the $\discount$ -contraction’s fixed point while paying sampling variance for dropping the model.

The TD(0) update

Given a single transition $(S_t, R_{t+1}, S_{t+1})$ generated under $\policy$ , TD(0) nudges the value of the visited state toward a bootstrapped target:

V(S_t) \;\leftarrow\; V(S_t) + \alpha\,\big[\, \underbrace{R_{t+1} + \discount V(S_{t+1})}_{\text{TD target}} - V(S_t) \,\big] = V(S_t) + \alpha\,\delta_t ,

with step size $\alpha \in (0,1]$ . Compare the three targets we have now for the same quantity $\valuefn_\policy(S_t)$ : Monte Carlo uses the full sampled return $G_t$ (no bootstrap, needs the episode to end); dynamic programming uses the model expectation $(\bellman^\policy V)(S_t)$ (full bootstrap, needs the model); TD(0) uses $R_{t+1} + \discount V(S_{t+1})$ — one sampled successor plus a bootstrap off the current estimate.

It updates every step, online, and never references

\transition

TD(0) as a stochastic Euler step

Why should nudging toward a bootstrapped (hence biased, while $V \ne \valuefn_\policy$ ) target converge to the right answer? Because in expectation the nudge is a Bellman-operator step.

Proposition 4.1 (Expected TD(0) update).

For any value estimate $V$ and any state $s$ , the expected TD error under $\policy$ is the Bellman expectation residual,

\E_\policy\!\big[\,\delta_t \,\big|\, S_t = s\,\big] = (\bellman^\policy V)(s) - V(s).

Hence the expected TD(0) update moves $V(s)$ a fraction $\alpha$ of the way toward $(\bellman^\policy V)(s)$ .

Proof.

Condition on $S_t = s$ and average over the action and transition drawn under $\policy$ :

\begin{aligned} \E_\policy[\delta_t \mid S_t = s] &= \E_\policy[\, R_{t+1} + \discount V(S_{t+1}) \mid S_t = s\,] - V(s) && \text{($V(s)$ is deterministic given $s$)} \\ &= \sum_a \policy(a\mid s)\sum_{s',r}\transition(s',r\mid s,a)\big[r + \discount V(s')\big] - V(s) && \text{(expand the expectation)} \\ &= (\bellman^\policy V)(s) - V(s) && \text{(definition of }\bellman^\policy\text{, Ch. 1 Def. 1.4).} \end{aligned}

So $\E[\,V(s) + \alpha\delta_t\,] = V(s) + \alpha\big[(\bellman^\policy V)(s) - V(s)\big]$ . $\qquad\blacksquare$

Read the deterministic part as a forward-Euler step on the ODE $\dot V = \bellman^\policy V - V$ , whose unique equilibrium is the fixed point $\bellman^\policy \valuefn_\policy = \valuefn_\policy$ — that is, $\valuefn_\policy$ .

TD(0) follows this flow with stochastic targets, so it is a Robbins–Monro stochastic-approximation scheme: under the step-size conditions

\sum_t \alpha_t = \infty

\sum_t \alpha_t^2 < \infty

and every state visited infinitely often,

V \to \valuefn_\policy

Sutton & Barto (2018) Bertsekas (2017) . The contraction of Chapter 1 supplies the stability the ODE needs; sampling supplies the variance that the diminishing

\alpha_t

averages away.

SARSA: on-policy control

Prediction estimates $\valuefn_\policy$ ; control needs action-values. SARSA forms the same TD error on $Q$ from the quintuple $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ — its namesake — where $A_{t+1}$ is the action the agent actually takes next under its (typically $\varepsilon$ -greedy) policy:

Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha\big[\,R_{t+1} + \discount Q(S_{t+1},A_{t+1}) - Q(S_t,A_t)\,\big].

Because the bootstrap uses the value of the sampled next action, SARSA is on-policy: it evaluates and improves the very policy it follows. Interleaving this evaluation with $\varepsilon$ -greedy improvement is generalized policy iteration (Chapter 1) driven by sampled errors; under GLIE (greedy in the limit with infinite exploration — $\varepsilon \to 0$ slowly) it converges to $\optqfn$ .

Q-learning: off-policy control

Q-learning Watkins & Dayan (1992) changes one symbol — the next action is replaced by the greedy one:

Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha\big[\,R_{t+1} + \discount \max_{a'} Q(S_{t+1},a') - Q(S_t,A_t)\,\big].

That $\max$ makes the target the sampled optimality backup, so Q-learning learns $\optqfn$ directly while behaving by any sufficiently exploratory policy — it is off-policy. Watkins and Dayan proved $Q \to \optqfn$ with probability one under the Robbins–Monro step sizes and infinite visits to every state–action pair. The contrast with SARSA is exactly the contrast between the two operators of Chapter 1, now on action-values.

Proposition 4.2 (SARSA and Q-learning target the two Bellman operators).

With the action-value operators $(\bellman^\policy Q)(s,a) = \sum_{s',r}\transition(s',r\mid s,a)[r + \discount\sum_{a'}\policy(a'\mid s')Q(s',a')]$ and $(\bellmanopt Q)(s,a) = \sum_{s',r}\transition(s',r\mid s,a)[r + \discount\max_{a'}Q(s',a')]$ , the expected one-step targets are

\E[\,\text{SARSA target}\mid S_t=s,A_t=a] = (\bellman^\policy Q)(s,a), \qquad \E[\,\text{Q-learning target}\mid S_t=s,A_t=a] = (\bellmanopt Q)(s,a).

So SARSA’s fixed point is $\qfn_\policy$ (for the policy it follows) and Q-learning’s is $\optqfn$ .

Proof.

Average each target over $(S_{t+1},R_{t+1})$ drawn from $\transition(\cdot\mid s,a)$ . SARSA additionally averages $A_{t+1}\sim\policy(\cdot\mid S_{t+1})$ , producing $\sum_{a'}\policy(a'\mid s')Q(s',a')$ inside the bracket — exactly $\bellman^\policy$ . Q-learning takes $\max_{a'}Q(s',a')$ deterministically, producing $\bellmanopt$ . A stochastic-approximation step toward a $\discount$ -contraction’s value converges to its unique fixed point (Prop. 4.1’s argument, now for the action-value operators): $\qfn_\policy$ for $\bellman^\policy$ , $\optqfn$ for $\bellmanopt$ . $\qquad\blacksquare$

On-policy vs off-policy: the cliff

The split is not academic. On the cliff-walking gridworld — a corridor whose bottom edge is a cliff that costs a large penalty and resets the agent — an optimal agent walks right along the cliff’s lip, the shortest path. Q-learning, estimating $\optqfn$ , learns exactly that risky-optimal route. SARSA, estimating $\qfn_\policy$ for the $\varepsilon$ -greedy policy it actually follows, accounts for the fact that exploration will occasionally step it off the cliff, so it prefers a safer path one row up.

Both are correct about different questions — and at a fixed

\varepsilon

, the start-state value Q-learning reports is the higher (optimal) one, while SARSA’s is lower because it embeds the cost of exploration. The companion measures exactly this gap.

n-step TD and the bias–variance dial

TD(0) and Monte Carlo are the endpoints of one family. The $n$ -step return $G_{t:t+n} = R_{t+1} + \discount R_{t+2} + \cdots + \discount^{n-1}R_{t+n} + \discount^{n} V(S_{t+n})$ bootstraps after $n$ sampled rewards; $n=1$ is TD(0), $n=\infty$ is Monte Carlo. Larger $n$ samples more (higher variance, less bootstrap bias); smaller $n$ bootstraps more (lower variance, more bias while $V$ is wrong). Intermediate $n$ — and its geometric average TD( $\lambda$ ) — usually beats both endpoints, the bias–variance dial made adjustable.

The dynamic-programming bridge

Temporal-difference learning closes a loop opened in Chapter 2. Real-time DP applied asynchronous Bellman backups along trajectories, but using the model expectation $\sum_{s'}\transition(s'\mid s,a)$ . Replace that expectation with a single sampled successor and the deterministic backup becomes the stochastic TD update — RTDP’s “Monte-Carlo cousin,” promised in Chapter 2, is precisely Q-learning. The three methods now form a clean table on two axes:

Dynamic programming — bootstrap, no sample (model expectation, full sweep).
Monte Carlo — sample, no bootstrap (full return, episode end).
Temporal difference — sample and bootstrap (one transition, online).

All three chase the same fixed point of the same $\discount$ -contraction; they differ only in how much they sample and how much they bootstrap.

What’s next

Week 5 replaces the value table with a parametric approximator $V_\theta$ . Bootstrapping (this chapter), off-policy data (Q-learning), and function approximation together form the deadly triad — the first place the clean contraction story breaks, because approximation moves the fixed point itself.

Exercises

(Derive) Show $\E_\policy[\delta_t \mid S_t = s] = (\bellman^\policy V)(s) - V(s)$ for any $V$ , and conclude TD(0) is a stochastic step toward $\valuefn_\policy$ (Prop. 4.1).

Solution
$\E_\policy[\delta_t\mid S_t=s] = \E_\policy[R_{t+1}+\discount V(S_{t+1})\mid S_t=s] - V(s) = \sum_a\policy(a\mid s)\sum_{s',r}\transition(s',r\mid s,a)[r+\discount V(s')] - V(s) = (\bellman^\policy V)(s) - V(s)$ . The expected update is therefore $V(s) + \alpha[(\bellman^\policy V)(s) - V(s)]$ , a damped step of the $\discount$ -contraction whose fixed point is $\valuefn_\policy$ .
(Prove) Show the expected Q-learning target is $(\bellmanopt Q)(s,a)$ and the expected SARSA target is $(\bellman^\policy Q)(s,a)$ , hence their fixed points are $\optqfn$ and $\qfn_\policy$ (Prop. 4.2).

Solution
Averaging over $(S_{t+1},R_{t+1})\sim\transition(\cdot\mid s,a)$ : Q-learning’s $\max_{a'}Q(S_{t+1},a')$ gives $\sum_{s',r}\transition[r+\discount\max_{a'}Q(s',a')] = (\bellmanopt Q)(s,a)$ . SARSA additionally averages $A_{t+1}\sim\policy(\cdot\mid S_{t+1})$ , giving $\sum_{s',r}\transition[r+\discount\sum_{a'}\policy(a'\mid s')Q(s',a')] = (\bellman^\policy Q)(s,a)$ . The fixed points follow from the contraction of each operator.
(Compute) With $\discount = 0.9$ , $\alpha = 0.1$ , current $V(s)=2$ , observed $R_{t+1}=1$ , $V(s')=3$ : compute $\delta_t$ and the updated $V(s)$ .

Solution
$\delta_t = 1 + 0.9\cdot 3 - 2 = 1.7$ ; $V(s)\leftarrow 2 + 0.1\cdot 1.7 = 2.17$ . The same numbers drive a SARSA or Q-learning update with $Q(s',\cdot)$ in place of $V(s')$ (the sampled-action value for SARSA, the max for Q-learning).
(Implement) In the companion, verify TD(0) converges to the analytic $\valuefn_\policy$ on the random walk; that Q-learning (GLIE) recovers the optimal cliff-edge policy while SARSA settles on the safer, slightly longer route; and that at a fixed $\varepsilon$ Q-learning’s start-state value is at least SARSA’s.

Solution
See experiments/python/week04/test_td.py: TD(0) matches dp.policy_evaluation within a sampling tolerance; Q-learning’s greedy policy attains the optimal start-state value from dp.value_iteration while SARSA’s greedy is a sensible but suboptimal goal-reaching route (the safe path), so the off-policy greedy value is $\ge$ the on-policy one; and the fixed- $\varepsilon$ start-state estimate of Q-learning is $\ge$ SARSA’s — the optimism/realism gap.
(Extend) Sweep $n$ in $n$ -step TD on the random walk and reproduce the characteristic U-shaped error-vs- $n$ curve (an intermediate $n$ beats both TD(0) and Monte Carlo).

Solution
Bootstrapping bias falls with $n$ while sampling variance rises; the sum is minimized at intermediate $n$ . The qualitative U-curve over $n$ (for a fixed $\alpha$ ) is the standard Sutton & Barto result; the companion’s --nstep sweep reproduces it.

Companion code

The Week-4 companion lives at experiments/python/week04/ and reuses earlier weeks: the random walk from Week 3 (randomwalk.py) for prediction, and the Week-1 dp solvers as the exact oracle for both prediction and control.

cliffwalk.py — a self-contained cliff-walking MDP (the Sutton & Barto Example 6.6 layout: a cliff row that penalizes and resets) as a generic (P, R) array pair, so $\optvaluefn$ and the optimal policy come from dp.value_iteration.
td.py — the transition sampler plus td0_prediction, sarsa, and q_learning, all operating on a generic (P, R, terminals), with optional GLIE schedules (visit-count step sizes, decaying $\varepsilon$ ). Pure NumPy.
test_td.py — mathematical-correctness tests: TD(0) converges to the analytic $\valuefn_\policy$ ; under GLIE Q-learning recovers the optimal start-state value while SARSA learns the safer, slightly longer route; and Q-learning’s fixed- $\varepsilon$ start value is at least SARSA’s (the cliff’s on-/off-policy gap).

# core algorithms + correctness tests (pure NumPy, no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week04/test_td.py -q

# worked SARSA-vs-Q-learning comparison on the cliff (prints both learned paths)
PYTHONPATH=. python experiments/python/week04/cliffwalk.py