Part I · Foundations Week 2 Published async_dp.py test_async_dp.py

Asynchronous and Prioritized Dynamic Programming

Keeping the gamma-contraction but varying the schedule: asynchronous and Gauss–Seidel value iteration, prioritized sweeping, and real-time DP. Why the order of backups is free — monotonicity plus a constant-shift identity — and how update order and residual size set the practical convergence rate.

On this page
  1. Generalized policy iteration: the space between the poles
  2. Asynchronous value iteration
  3. Why the order is free: monotonicity and the constant-shift identity
  4. Gauss–Seidel sweeps and the numerical-analysis analogy
  5. The Bellman residual and residual scheduling
  6. Prioritized sweeping
  7. Real-time dynamic programming
  8. Counting the work: iterations and samples
  9. The dynamic-programming bridge
  10. What’s next
  11. Exercises
  12. Companion code

Asynchronous and Prioritized Dynamic Programming

Where we are. Chapter 1 proved that value iteration and policy iteration both reach the optimal value function because the Bellman operators are γ\discount-contractions. Both methods share one extravagance: every sweep backs up every state, in lockstep, whether or not that state’s value has stopped moving. This chapter keeps the contraction and varies only the schedule — which states we back up, in what order, and using whose values. The load-bearing claim is that the order is nearly free: as long as every state is updated infinitely often, asynchronous dynamic programming converges to the same fixed point v\optvaluefn, and ordering the updates well — Gauss–Seidel, prioritized sweeping, real-time DP — buys large savings without touching the convergence theory.

Generalized policy iteration: the space between the poles

Chapter 1 closed by naming value iteration and policy iteration the two poles of generalized policy iteration (GPI): VI applies one optimality backup everywhere and reads off a greedy policy at the end; PI evaluates a policy exactly (a linear solve) and then improves it greedily. The methods in between truncate the evaluation. Modified policy iteration Puterman (1994) runs mm sweeps of the expectation operator Tπ\bellman^\policy in place of the exact solve: m=1m=1 recovers value iteration, mm\to\infty recovers policy iteration, and intermediate mm trades evaluation cost against the number of outer improvements. Sutton & Barto Sutton & Barto (2018) frame the entire family as two interacting processes — make the value consistent with the policy, make the policy greedy for the value — that converge to the same fixed point regardless of how finely they are interleaved.

This chapter relaxes a different lockstep: not how much we evaluate between improvements, but which states we touch on each pass and in what order.

Asynchronous value iteration

Synchronous value iteration computes the whole vector TVk\bellmanopt V_k from VkV_k and only then overwrites — every state is backed up from the same old values. Asynchronous value iteration drops that barrier.

Definition 2.1 (Asynchronous value iteration).

Asynchronous value iteration maintains a single value vector VV and repeats: select a state ss and overwrite

V(s)    (TV)(s)=maxa[r(s,a)+γsp(ss,a)V(s)],V(s) \;\leftarrow\; (\bellmanopt V)(s) = \max_{a}\Big[\reward(s,a) + \discount \sum_{s'}\transition(s'\mid s,a)\,V(s')\Big],

using the current entries of VV — including any updated earlier in the pass. States may be visited in any order, subject only to the fairness condition that every state is selected infinitely often. The special case that sweeps the states in a fixed order 1,2,,S1,2,\dots,\lvert\statespace\rvert, each backup seeing the freshly written values of its predecessors, is Gauss–Seidel value iteration.

The synchronous operator of Chapter 1 reuses no in-pass information; Gauss–Seidel reuses all of it; general asynchronous iteration lives anywhere between. The question is whether dropping the barrier costs us the convergence guarantee. It does not, and the reason is two structural facts that cost one line each.

Why the order is free: monotonicity and the constant-shift identity

Proposition 2.1 (Monotonicity and constant-shift).

Let TT be either Bellman operator (Tπ\bellman^\policy or T\bellmanopt). For all u,vRSu,v\in\R^{\statespace} and every constant cRc\in\R, writing 1\mathbf 1 for the all-ones vector:

  1. Monotonicity. If uvu \le v componentwise, then TuTvTu \le Tv.
  2. Constant-shift. T(v+c1)=Tv+γc1T(v + c\,\mathbf 1) = Tv + \discount c\,\mathbf 1.
Proof.

Both read off the backup Qv(s,a)r(s,a)+γsp(ss,a)v(s)Q_v(s,a) \defeq \reward(s,a) + \discount\sum_{s'} \transition(s'\mid s,a)\,v(s'), of which (Tv)(s)=maxaQv(s,a)(\bellmanopt v)(s)=\max_a Q_v(s,a) and (Tπv)(s)=aπ(as)Qv(s,a)(\bellman^\policy v)(s)=\sum_a\policy(a\mid s)Q_v(s,a).

Monotonicity. If uvu\le v then sp(ss,a)u(s)sp(ss,a)v(s)\sum_{s'}\transition(s'\mid s,a)u(s') \le \sum_{s'}\transition(s'\mid s,a)v(s') because the transition weights are nonnegative; hence Qu(s,a)Qv(s,a)Q_u(s,a)\le Q_v(s,a) for every (s,a)(s,a), and taking the max\max (or the π\policy-average) over aa preserves the inequality.

Constant-shift. Since sp(ss,a)=1\sum_{s'}\transition(s'\mid s,a)=1, replacing vv by v+c1v+c\,\mathbf 1 adds exactly γc\discount c to every Qv(s,a)Q_v(s,a); and maxa(xa+γc)=maxaxa+γc\max_a(x_a+\discount c)=\max_a x_a+\discount c (likewise for the average). \qquad\blacksquare

These two facts — and not the full contraction algebra — are what make asynchronous order irrelevant.

Theorem 2.1 (Asynchronous convergence).

Run asynchronous value iteration (Def. 2.1) from any V0V_0, updating every state infinitely often. Then VkvV_k \to \optvaluefn. Moreover, grouping the schedule into rounds — a round ends once every state has been backed up at least once since the round began — the sup-norm error contracts by a factor γ\discount per round:

Vafter round mvγmV0v.\norm{V_{\text{after round } m} - \optvaluefn}_\infty \le \discount^{m}\, \norm{V_0 - \optvaluefn}_\infty .
Proof.

Let cV0vc \defeq \norm{V_0-\optvaluefn}_\infty, so the iterate starts inside the box vc1V0v+c1\optvaluefn - c\,\mathbf 1 \le V_0 \le \optvaluefn + c\,\mathbf 1. Suppose at some moment VV lies in this box. Back up any state ss. By monotonicity and then constant-shift (Prop. 2.1), using Tv=v\bellmanopt\optvaluefn=\optvaluefn,

(TV)(s)(T(v+c1))(s)=v(s)+γc,(TV)(s)v(s)γc.(\bellmanopt V)(s) \le \big(\bellmanopt(\optvaluefn + c\,\mathbf 1)\big)(s) = \optvaluefn(s) + \discount c, \qquad (\bellmanopt V)(s) \ge \optvaluefn(s) - \discount c .

So the updated entry lands in the smaller box [v(s)γc,v(s)+γc][v(s)c,v(s)+c][\optvaluefn(s)-\discount c,\, \optvaluefn(s)+\discount c]\subseteq[\optvaluefn(s)-c,\,\optvaluefn(s)+c] (as γ<1\discount<1). Two consequences: the iterate never leaves the original cc-box, and every state, once backed up, sits in the γc\discount c-box. Crucially, later backups in the round still read values inside the original cc-box, so they too map into the γc\discount c-box. Hence after one full round Vvγc\norm{V-\optvaluefn}_\infty\le\discount c. Applying the argument round by round gives γmc\discount^{m}c after mm rounds; fairness guarantees infinitely many rounds, so VkvV_k\to\optvaluefn. \qquad\blacksquare

Synchronous value iteration is the special case in which a round is a single simultaneous sweep, and the bound collapses to Chapter 1’s VkvγkV0v\norm{V_k-\optvaluefn}_\infty\le\discount^{k}\norm{V_0-\optvaluefn}_\infty. The abstract version — monotone contraction operators converge under any fair asynchronous schedule — is the backbone of Bertsekas’s treatment of dynamic programming Bertsekas (2017) .

Gauss–Seidel sweeps and the numerical-analysis analogy

The names are borrowed deliberately. Solving a linear system Ax=bAx=b by splitting AA and iterating, Jacobi updates every coordinate from the previous iterate (a synchronous sweep), while Gauss–Seidel updates coordinate ii using the already-updated coordinates 1,,i11,\dots,i-1 — in place, no second buffer.

Value iteration is the nonlinear analogue: synchronous VI is Jacobi, Gauss–Seidel VI is Gauss–Seidel. The in-place version typically converges in fewer sweeps because information propagates across the state space within a single pass rather than waiting a full sweep per hop, and it halves the memory (no separate Vk+1V_{k+1} buffer). Theorem 2.1 already covers it: a left-to-right sweep updates every state once, so it is exactly one round.

The order within a sweep is now a design lever. If states are numbered along the direction reward information flows — outward from a goal, say — a single Gauss–Seidel sweep can propagate values across the whole space, whereas the reverse order wastes most of the pass. This is the seed of prioritizing the order rather than fixing it.

The Bellman residual and residual scheduling

How do we know where values are still moving? The same quantity that certifies stopping. Recall from Chapter 1 the local Bellman residual at a state, the amount one more backup would change it,

Δ(s)(TV)(s)V(s),\Delta(s) \defeq \big\lvert (\bellmanopt V)(s) - V(s) \big\rvert ,

and that the global residual TVV\norm{\bellmanopt V - V}_\infty controls the true error through Vvγ1γTVV\norm{V-\optvaluefn}_\infty \le \frac{\discount}{1-\discount} \norm{\bellmanopt V - V}_\infty. A uniform sweep spends equal effort on states with Δ(s)0\Delta(s)\approx 0 (already converged) and on states with large Δ(s)\Delta(s) (still far off). Residual scheduling is the obvious correction: back up states in decreasing order of Δ(s)\Delta(s), and skip states whose residual sits below a tolerance θ\theta. Asynchronous convergence (Thm. 2.1) licenses any such order as long as no state is starved.

Prioritized sweeping

Prioritized sweeping Moore & Atkeson (1993) makes residual scheduling precise with a priority queue, and adds the key idea that a backup only changes a state’s predecessors, so only their priorities need refreshing.

initialize V ← 0;  priority queue PQ keyed by Bellman residual
seed PQ with every state s at priority |(𝒯* V)(s) − V(s)|
repeat:
    s ← PQ.pop_max()                      # largest-residual state
    V(s) ← (𝒯* V)(s)                       # one optimality backup
    for each (s̄, a) with p(s | s̄, a) > 0:  # predecessors that feed s
        e ← |(𝒯* V)(s̄) − V(s̄)|             # their refreshed residual
        if e > θ:  PQ.push(s̄, priority = e)
until PQ empty                            # every residual below θ

On a sparse-reward problem the queue concentrates work near the frontier where values are actually changing, leaving the converged interior untouched; Moore and Atkeson report order-of-magnitude reductions in backups to reach a target accuracy on large MDPs. The method is exact-model dynamic programming — it needs p\transition to form both the backup and the predecessor sets — but it is the direct ancestor of the sampled prioritized-replay schemes that return in Week 6.

Real-time dynamic programming

Prioritized sweeping orders updates by residual; real-time dynamic programming (RTDP) orders them by reachability Barto et al. (1995) . Instead of enumerating states, RTDP backs up only the states it actually visits while acting greedily, interleaving computation with control.

repeat for each trial:
    s ← start state
    while s is not terminal:
        V(s) ← (𝒯* V)(s)               # back up the current state
        a ← greedy action at s w.r.t. V
        s ← sample s' ∼ p(· | s, a)     # follow the greedy trajectory

Because backups concentrate on the states reachable under good policies, RTDP can solve problems whose full state space is far too large to sweep, converging on the relevant subset without ever touching the rest. It is also the conceptual hinge to model-free learning: replace the expectation sp(ss,a)\sum_{s'}\transition(s'\mid s,a) in the backup with the single sampled successor, and the deterministic backup becomes a stochastic one — RTDP’s Monte-Carlo cousin is Q-learning, which Weeks 3–4 build from exactly this move.

Counting the work: iterations and samples

How many backups does the discount cost us? For synchronous value iteration the geometric bound answers it directly.

Proposition 2.2 (Iteration complexity of value iteration).

Starting from V0=0V_0=0, synchronous value iteration reaches Vkvε\norm{V_k-\optvaluefn}_\infty\le\varepsilon after

k    ln ⁣(Rmax/[ε(1γ)])ln(1/γ)  =  O ⁣(11γln1ε)k \;\ge\; \frac{\ln\!\big(R_{\max}\,/\,[\varepsilon(1-\discount)]\big)}{\ln(1/\discount)} \;=\; O\!\Big(\frac{1}{1-\discount}\,\ln\frac{1}{\varepsilon}\Big)

sweeps, where Rmax=maxs,ar(s,a)R_{\max}=\max_{s,a}\lvert\reward(s,a)\rvert.

Proof.

With V0=0V_0=0, Chapter 1’s Corollary 1.1 gives VkvγkvγkRmax/(1γ)\norm{V_k-\optvaluefn}_\infty \le \discount^{k}\norm{\optvaluefn}_\infty \le \discount^{k} R_{\max}/(1-\discount), using the return bound vRmax/(1γ)\norm{\optvaluefn}_\infty\le R_{\max}/(1-\discount). Requiring the right side ε\le\varepsilon and taking logs gives the displayed kk. The order form uses ln(1/γ)1γ\ln(1/\discount)\ge 1-\discount for γ(0,1)\discount\in(0,1), so 1/ln(1/γ)1/(1γ)1/\ln(1/\discount)\le 1/(1-\discount). \qquad\blacksquare

The count is only logarithmic in the accuracy ε\varepsilon but linear in the effective horizon 1/(1γ)1/(1-\discount) — push γ\discount toward 11 and the work grows without bound. Tseng Tseng (1990) made this precise, showing stationary discounted MDPs are solved in time proportional to log\log of the horizon. The same 1/(1γ)1/(1-\discount) governs the sample cost once the model is unknown: given only a generative model (a sampler for p\transition), estimating an ε\varepsilon-optimal value needs O~ ⁣(SA(1γ)3ε2)\tilde O\!\big(\lvert\statespace\rvert\lvert\actionspace\rvert (1-\discount)^{-3}\varepsilon^{-2}\big) samples, a bound matched from both sides by Azar et al. Azar et al. (2013) and achieved in near-optimal time by the variance-reduced value iteration of Sidford et al. Sidford et al. (2018) . That cubic blow-up in the horizon is precisely the pressure that drives the curriculum out of the exact-model regime and into sampling (Week 3) and function approximation (Week 5).

The dynamic-programming bridge

Asynchronous DP is where the numerical-analysis reading of value iteration becomes literal. A value function is a point in RS\R^{\statespace}; the Bellman operator is a nonlinear map whose fixed point we chase; and the Jacobi / Gauss–Seidel / residual-ordered hierarchy is the same one used to solve large sparse linear systems, transplanted to a contraction that happens to involve a max\max. Two threads carry forward:

  • To online control (MPC, Week 15). RTDP’s instinct — compute only over the states you will actually visit, refreshing as you go — is the tabular shadow of model predictive control, which re-solves a short-horizon optimal-control problem from the current state at every step instead of storing v\optvaluefn everywhere.
  • To model-free RL (Weeks 3–6). Replace the model expectation in a backup with a sampled successor and asynchronous DP becomes temporal-difference learning; prioritized sweeping becomes prioritized experience replay. The schedule ideas of this chapter reappear, now applied to which transitions to learn from.

What’s next

  • Week 3 removes the known model entirely: values are estimated from sampled returns (Monte Carlo), trading the model expectation for an empirical average and introducing the bias–variance questions that dominate the rest of Part I.
  • Week 4 fuses the two views — bootstrapping like DP, sampling like Monte Carlo — into temporal-difference learning, the stochastic-approximation form of the asynchronous backup above.

Exercises

  1. (Prove) Show that the Bellman optimality operator satisfies the constant-shift identity T(v+c1)=Tv+γc1\bellmanopt(v+c\,\mathbf 1)=\bellmanopt v+\discount c\,\mathbf 1 for every cRc\in\R (Prop. 2.1(2)), and explain where the proof of Theorem 2.1 uses it.

    Solution

    For each (s,a)(s,a), sp(ss,a)(v(s)+c)=sp(ss,a)v(s)+c\sum_{s'}\transition(s'\mid s,a)\,(v(s')+c)=\sum_{s'} \transition(s'\mid s,a)v(s')+c since the transition row sums to 11, so every Qv(s,a)Q_v(s,a) rises by γc\discount c; the per-state maxa\max_a then rises by γc\discount c. Theorem 2.1 uses it to evaluate T(v±c1)=v±γc1\bellmanopt(\optvaluefn\pm c\,\mathbf 1)=\optvaluefn\pm\discount c\,\mathbf 1, which shrinks the bounding box by a factor γ\discount each round.

  2. (Derive) Argue that one full Gauss–Seidel sweep is one “round” in the sense of Theorem 2.1, and conclude that Gauss–Seidel value iteration converges at least as fast (per sweep) as synchronous value iteration.

    Solution

    A left-to-right sweep updates each state exactly once, so by the end every state has been backed up since the sweep began — a round — and Theorem 2.1 gives VvγVprev\norm{V-\optvaluefn}_\infty\le\discount\norm{V_{\text{pre}}-\optvaluefn}_\infty. Synchronous VI achieves the same per-sweep factor but reads only stale values; Gauss–Seidel additionally uses freshly updated predecessors within the sweep, so its actual contraction per sweep is no worse and usually strictly better.

  3. (Compute) A backup at state ss changes the residual of which other states? Give the set in terms of p\transition, and explain why prioritized sweeping only re-queues those.

    Solution

    Only the predecessors {sˉ:a, p(ssˉ,a)>0}\{\bar s : \exists a,\ \transition(s\mid\bar s,a)>0\}: a state sˉ\bar s‘s backup depends on V(s)V(s) only if some action from sˉ\bar s can reach ss. Every other state’s residual is unchanged by overwriting V(s)V(s), so re-evaluating it would be wasted work — hence the predecessor loop in the prioritized-sweeping inner step.

  4. (Implement) In the companion, run synchronous VI, Gauss–Seidel VI, and prioritized sweeping on the stochastic gridworld and verify (i) all three reach the policy-iteration value v\optvaluefn, and (ii) to reach a fixed target accuracy ε\varepsilon — set the prioritized-sweeping threshold θ=ε(1γ)/γ\theta=\varepsilon(1-\discount)/\discount — prioritized sweeping spends fewer state-backups than the synchronous sweep count S×(sweeps)\lvert\statespace\rvert\times(\text{sweeps}).

    Solution

    See experiments/python/week02/test_async_dp.py: it asserts each method matches dp.policy_iteration, and that at ε=103\varepsilon=10^{-3} the prioritized-sweeping backup count is below the synchronous total. The win is in reaching useful accuracy fast — the queue concentrates on high-residual states — and it widens with S\lvert\statespace\rvert; the edge narrows as ε\varepsilon approaches machine precision, where residuals near the fixed point thrash the queue.

  5. (Extend) Perturb the gridworld’s transition model (raise the slip probability) and measure how the optimal policy and v\optvaluefn change. Relate the sensitivity to the 1/(1γ)1/(1-\discount) factor in Proposition 2.2.

    Solution

    A model error of size δ\delta in p\transition perturbs the fixed point by O ⁣(γδ/(1γ))O\!\big(\discount\delta/(1-\discount)\big) (a simulation-lemma bound — a perturbation bound on the DP fixed point): the same effective horizon that sets the iteration count also amplifies model error, which is the quantitative reason model misspecification hurts more as γ1\discount\to1. The companion’s --slip sweep exhibits the effect.

Companion code

The Week-2 companion lives at experiments/python/week02/ and reuses the Week-1 primitives — it imports dp.py for action_values, bellman_optimality_operator, value_iteration, and policy_iteration rather than reimplementing them, so the asynchronous methods are checked against the same reference fixed point.

  • gridworld.py — builds a parametric stochastic gridworld as a generic (P, R, gamma) MDP in the dp.py representation: a goal cell, a per-step cost, and a slip probability that randomizes the intended move. Pure NumPy.
  • async_dp.py — Gauss–Seidel (in-place) value iteration and model-based prioritized sweeping with an explicit predecessor index and a backup counter, both built on the dp.py operators.
  • test_async_dp.py — mathematical-correctness tests: Gauss–Seidel and prioritized sweeping converge to the same v\optvaluefn as value_iteration and policy_iteration (from arbitrary starts); the constant-shift and monotonicity properties of Proposition 2.1 hold numerically; and prioritized sweeping uses fewer backups than a synchronous sweep schedule to reach a target accuracy on a sparse-reward grid.
# core algorithms + correctness tests (pure NumPy, no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week02/test_async_dp.py -q

# worked comparison + optional value/residual heatmaps (saved locally, not committed)
PYTHONPATH=. python experiments/python/week02/async_dp.py --plot