Temporal-Difference Learning: TD(0), SARSA, and Q-Learning
Bootstrapping from sampled transitions: the TD(0) prediction update as a stochastic Euler step toward the Bellman fixed point, SARSA as on-policy control, and Q-learning as off-policy control. Where temporal-difference learning sits between Monte Carlo and dynamic programming, and what the SARSA/Q-learning split reveals on the cliff.
On this page
Temporal-Difference Learning: TD(0), SARSA, and Q-Learning
Where we are. Monte Carlo (Chapter 3) waited for a full episode to terminate, then averaged the realized return — unbiased, but high-variance and episodic-only. Dynamic programming (Chapters 1–2) never sampled at all; it bootstrapped each value from other current estimates through the model. Temporal-difference learning is the missing synthesis: sample one transition like Monte Carlo, and bootstrap from the current estimate like dynamic programming. The result learns online, from a single step, with no model and no wait for the episode to end. Its load-bearing object is the TD error — a one-sample estimate of the Bellman residual.
The TD(0) update
Given a single transition generated under , TD(0) nudges the value of the visited state toward a bootstrapped target:
with step size . Compare the three targets we have now for the same quantity : Monte Carlo uses the full sampled return (no bootstrap, needs the episode to end); dynamic programming uses the model expectation (full bootstrap, needs the model); TD(0) uses — one sampled successor plus a bootstrap off the current estimate. It updates every step, online, and never references .
TD(0) as a stochastic Euler step
Why should nudging toward a bootstrapped (hence biased, while ) target converge to the right answer? Because in expectation the nudge is a Bellman-operator step.
For any value estimate and any state , the expected TD error under is the Bellman expectation residual,
Hence the expected TD(0) update moves a fraction of the way toward .
Condition on and average over the action and transition drawn under :
So .
Read the deterministic part as a forward-Euler step on the ODE , whose unique equilibrium is the fixed point — that is, . TD(0) follows this flow with stochastic targets, so it is a Robbins–Monro stochastic-approximation scheme: under the step-size conditions , and every state visited infinitely often, Sutton & Barto (2018) Bertsekas (2017) . The contraction of Chapter 1 supplies the stability the ODE needs; sampling supplies the variance that the diminishing averages away.
SARSA: on-policy control
Prediction estimates ; control needs action-values. SARSA forms the same TD error on from the quintuple — its namesake — where is the action the agent actually takes next under its (typically -greedy) policy:
Because the bootstrap uses the value of the sampled next action, SARSA is on-policy: it evaluates and improves the very policy it follows. Interleaving this evaluation with -greedy improvement is generalized policy iteration (Chapter 1) driven by sampled errors; under GLIE (greedy in the limit with infinite exploration — slowly) it converges to .
Q-learning: off-policy control
Q-learning Watkins & Dayan (1992) changes one symbol — the next action is replaced by the greedy one:
That makes the target the sampled optimality backup, so Q-learning learns directly while behaving by any sufficiently exploratory policy — it is off-policy. Watkins and Dayan proved with probability one under the Robbins–Monro step sizes and infinite visits to every state–action pair. The contrast with SARSA is exactly the contrast between the two operators of Chapter 1, now on action-values.
With the action-value operators and , the expected one-step targets are
So SARSA’s fixed point is (for the policy it follows) and Q-learning’s is .
Average each target over drawn from . SARSA additionally averages , producing inside the bracket — exactly . Q-learning takes deterministically, producing . A stochastic-approximation step toward a -contraction’s value converges to its unique fixed point (Prop. 4.1’s argument, now for the action-value operators): for , for .
On-policy vs off-policy: the cliff
The split is not academic. On the cliff-walking gridworld — a corridor whose bottom edge is a cliff that costs a large penalty and resets the agent — an optimal agent walks right along the cliff’s lip, the shortest path. Q-learning, estimating , learns exactly that risky-optimal route. SARSA, estimating for the -greedy policy it actually follows, accounts for the fact that exploration will occasionally step it off the cliff, so it prefers a safer path one row up. Both are correct about different questions — and at a fixed , the start-state value Q-learning reports is the higher (optimal) one, while SARSA’s is lower because it embeds the cost of exploration. The companion measures exactly this gap.
n-step TD and the bias–variance dial
TD(0) and Monte Carlo are the endpoints of one family. The -step return bootstraps after sampled rewards; is TD(0), is Monte Carlo. Larger samples more (higher variance, less bootstrap bias); smaller bootstraps more (lower variance, more bias while is wrong). Intermediate — and its geometric average TD() — usually beats both endpoints, the bias–variance dial made adjustable.
The dynamic-programming bridge
Temporal-difference learning closes a loop opened in Chapter 2. Real-time DP applied asynchronous Bellman backups along trajectories, but using the model expectation . Replace that expectation with a single sampled successor and the deterministic backup becomes the stochastic TD update — RTDP’s “Monte-Carlo cousin,” promised in Chapter 2, is precisely Q-learning. The three methods now form a clean table on two axes:
- Dynamic programming — bootstrap, no sample (model expectation, full sweep).
- Monte Carlo — sample, no bootstrap (full return, episode end).
- Temporal difference — sample and bootstrap (one transition, online).
All three chase the same fixed point of the same -contraction; they differ only in how much they sample and how much they bootstrap.
What’s next
- Week 5 replaces the value table with a parametric approximator . Bootstrapping (this chapter), off-policy data (Q-learning), and function approximation together form the deadly triad — the first place the clean contraction story breaks, because approximation moves the fixed point itself.
Exercises
-
(Derive) Show for any , and conclude TD(0) is a stochastic step toward (Prop. 4.1).
Solution
. The expected update is therefore , a damped step of the -contraction whose fixed point is .
-
(Prove) Show the expected Q-learning target is and the expected SARSA target is , hence their fixed points are and (Prop. 4.2).
Solution
Averaging over : Q-learning’s gives . SARSA additionally averages , giving . The fixed points follow from the contraction of each operator.
-
(Compute) With , , current , observed , : compute and the updated .
Solution
; . The same numbers drive a SARSA or Q-learning update with in place of (the sampled-action value for SARSA, the max for Q-learning).
-
(Implement) In the companion, verify TD(0) converges to the analytic on the random walk; that Q-learning (GLIE) recovers the optimal cliff-edge policy while SARSA settles on the safer, slightly longer route; and that at a fixed Q-learning’s start-state value is at least SARSA’s.
Solution
See
experiments/python/week04/test_td.py: TD(0) matchesdp.policy_evaluationwithin a sampling tolerance; Q-learning’s greedy policy attains the optimal start-state value fromdp.value_iterationwhile SARSA’s greedy is a sensible but suboptimal goal-reaching route (the safe path), so the off-policy greedy value is the on-policy one; and the fixed- start-state estimate of Q-learning is SARSA’s — the optimism/realism gap. -
(Extend) Sweep in -step TD on the random walk and reproduce the characteristic U-shaped error-vs- curve (an intermediate beats both TD(0) and Monte Carlo).
Solution
Bootstrapping bias falls with while sampling variance rises; the sum is minimized at intermediate . The qualitative U-curve over (for a fixed ) is the standard Sutton & Barto result; the companion’s
--nstepsweep reproduces it.
Companion code
The Week-4 companion lives at experiments/python/week04/ and reuses earlier
weeks: the random walk from Week 3 (randomwalk.py) for prediction, and the Week-1
dp solvers as the exact oracle for both prediction and control.
cliffwalk.py— a self-contained cliff-walking MDP (the Sutton & Barto Example 6.6 layout: a cliff row that penalizes and resets) as a generic(P, R)array pair, so and the optimal policy come fromdp.value_iteration.td.py— the transition sampler plustd0_prediction,sarsa, andq_learning, all operating on a generic(P, R, terminals), with optional GLIE schedules (visit-count step sizes, decaying ). Pure NumPy.test_td.py— mathematical-correctness tests: TD(0) converges to the analytic ; under GLIE Q-learning recovers the optimal start-state value while SARSA learns the safer, slightly longer route; and Q-learning’s fixed- start value is at least SARSA’s (the cliff’s on-/off-policy gap).
# core algorithms + correctness tests (pure NumPy, no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week04/test_td.py -q
# worked SARSA-vs-Q-learning comparison on the cliff (prints both learned paths)
PYTHONPATH=. python experiments/python/week04/cliffwalk.py