Part I · Foundations Week 2 Published async_dp.py test_async_dp.py

Asynchronous and Prioritized Dynamic Programming

Keeping the gamma-contraction but varying the schedule: asynchronous and Gauss–Seidel value iteration, prioritized sweeping, and real-time DP. Why the order of backups is free — monotonicity plus a constant-shift identity — and how update order and residual size set the practical convergence rate.

On this page

Generalized policy iteration: the space between the poles
Asynchronous value iteration
Why the order is free: monotonicity and the constant-shift identity
Gauss–Seidel sweeps and the numerical-analysis analogy
The Bellman residual and residual scheduling
Prioritized sweeping
Real-time dynamic programming
Counting the work: iterations and samples
The dynamic-programming bridge
What’s next
Exercises
Companion code

Asynchronous and Prioritized Dynamic Programming

Where we are. Chapter 1 proved that value iteration and policy iteration both reach the optimal value function because the Bellman operators are $\discount$ -contractions. Both methods share one extravagance: every sweep backs up every state, in lockstep, whether or not that state’s value has stopped moving. This chapter keeps the contraction and varies only the schedule — which states we back up, in what order, and using whose values. The load-bearing claim is that the order is nearly free: as long as every state is updated infinitely often, asynchronous dynamic programming converges to the same fixed point $\optvaluefn$ , and ordering the updates well — Gauss–Seidel, prioritized sweeping, real-time DP — buys large savings without touching the convergence theory.

Chapter 2 — at a glance

Goal. Define asynchronous and Gauss–Seidel value iteration; prove they converge using two structural properties (monotonicity and a constant-shift identity); read prioritized sweeping and real-time DP off the same theory; and quantify how update order and the discount set the number of backups.

Reading time. ~35 minutes; ~55 with the proof and exercises.

Key insight — the DP bridge. Synchronous value iteration is the Jacobi iteration for a nonlinear fixed-point system; Gauss–Seidel value iteration is its in-place cousin and prioritized sweeping its residual-ordered cousin — exactly the hierarchy classical numerical linear algebra builds for linear systems. The discount $\discount$ is again the load-bearing constant: it is the contraction modulus, and $1/(1-\discount)$ is the effective horizon that sets the iteration count.

Generalized policy iteration: the space between the poles

Chapter 1 closed by naming value iteration and policy iteration the two poles of generalized policy iteration (GPI): VI applies one optimality backup everywhere and reads off a greedy policy at the end; PI evaluates a policy exactly (a linear solve) and then improves it greedily. The methods in between truncate the evaluation. Modified policy iteration Puterman (1994) runs $m$ sweeps of the expectation operator $\bellman^\policy$ in place of the exact solve: $m=1$ recovers value iteration, $m\to\infty$ recovers policy iteration, and intermediate $m$ trades evaluation cost against the number of outer improvements. Sutton & Barto Sutton & Barto (2018) frame the entire family as two interacting processes — make the value consistent with the policy, make the policy greedy for the value — that converge to the same fixed point regardless of how finely they are interleaved.

This chapter relaxes a different lockstep: not how much we evaluate between improvements, but which states we touch on each pass and in what order.

Asynchronous value iteration

Synchronous value iteration computes the whole vector $\bellmanopt V_k$ from $V_k$ and only then overwrites — every state is backed up from the same old values. Asynchronous value iteration drops that barrier.

Definition 2.1 (Asynchronous value iteration).

Asynchronous value iteration maintains a single value vector $V$ and repeats: select a state $s$ and overwrite

V(s) \;\leftarrow\; (\bellmanopt V)(s) = \max_{a}\Big[\reward(s,a) + \discount \sum_{s'}\transition(s'\mid s,a)\,V(s')\Big],

using the current entries of $V$ — including any updated earlier in the pass. States may be visited in any order, subject only to the fairness condition that every state is selected infinitely often. The special case that sweeps the states in a fixed order $1,2,\dots,\lvert\statespace\rvert$ , each backup seeing the freshly written values of its predecessors, is Gauss–Seidel value iteration.

The synchronous operator of Chapter 1 reuses no in-pass information; Gauss–Seidel reuses all of it; general asynchronous iteration lives anywhere between. The question is whether dropping the barrier costs us the convergence guarantee. It does not, and the reason is two structural facts that cost one line each.

Why the order is free: monotonicity and the constant-shift identity

Proposition 2.1 (Monotonicity and constant-shift).

Let $T$ be either Bellman operator ( $\bellman^\policy$ or $\bellmanopt$ ). For all $u,v\in\R^{\statespace}$ and every constant $c\in\R$ , writing $\mathbf 1$ for the all-ones vector:

Monotonicity. If $u \le v$ componentwise, then $Tu \le Tv$ .
Constant-shift. $T(v + c\,\mathbf 1) = Tv + \discount c\,\mathbf 1$ .

Proof.

Both read off the backup $Q_v(s,a) \defeq \reward(s,a) + \discount\sum_{s'} \transition(s'\mid s,a)\,v(s')$ , of which $(\bellmanopt v)(s)=\max_a Q_v(s,a)$ and $(\bellman^\policy v)(s)=\sum_a\policy(a\mid s)Q_v(s,a)$ .

Monotonicity. If $u\le v$ then $\sum_{s'}\transition(s'\mid s,a)u(s') \le \sum_{s'}\transition(s'\mid s,a)v(s')$ because the transition weights are nonnegative; hence $Q_u(s,a)\le Q_v(s,a)$ for every $(s,a)$ , and taking the $\max$ (or the $\policy$ -average) over $a$ preserves the inequality.

Constant-shift. Since $\sum_{s'}\transition(s'\mid s,a)=1$ , replacing $v$ by $v+c\,\mathbf 1$ adds exactly $\discount c$ to every $Q_v(s,a)$ ; and $\max_a(x_a+\discount c)=\max_a x_a+\discount c$ (likewise for the average). $\qquad\blacksquare$

These two facts — and not the full contraction algebra — are what make asynchronous order irrelevant.

Theorem 2.1 (Asynchronous convergence).

Run asynchronous value iteration (Def. 2.1) from any $V_0$ , updating every state infinitely often. Then $V_k \to \optvaluefn$ . Moreover, grouping the schedule into rounds — a round ends once every state has been backed up at least once since the round began — the sup-norm error contracts by a factor $\discount$ per round:

\norm{V_{\text{after round } m} - \optvaluefn}_\infty \le \discount^{m}\, \norm{V_0 - \optvaluefn}_\infty .

Proof.

Let $c \defeq \norm{V_0-\optvaluefn}_\infty$ , so the iterate starts inside the box $\optvaluefn - c\,\mathbf 1 \le V_0 \le \optvaluefn + c\,\mathbf 1$ . Suppose at some moment $V$ lies in this box. Back up any state $s$ . By monotonicity and then constant-shift (Prop. 2.1), using $\bellmanopt\optvaluefn=\optvaluefn$ ,

(\bellmanopt V)(s) \le \big(\bellmanopt(\optvaluefn + c\,\mathbf 1)\big)(s) = \optvaluefn(s) + \discount c, \qquad (\bellmanopt V)(s) \ge \optvaluefn(s) - \discount c .

So the updated entry lands in the smaller box $[\optvaluefn(s)-\discount c,\, \optvaluefn(s)+\discount c]\subseteq[\optvaluefn(s)-c,\,\optvaluefn(s)+c]$ (as $\discount<1$ ). Two consequences: the iterate never leaves the original $c$ -box, and every state, once backed up, sits in the $\discount c$ -box. Crucially, later backups in the round still read values inside the original $c$ -box, so they too map into the $\discount c$ -box. Hence after one full round $\norm{V-\optvaluefn}_\infty\le\discount c$ . Applying the argument round by round gives $\discount^{m}c$ after $m$ rounds; fairness guarantees infinitely many rounds, so $V_k\to\optvaluefn$ . $\qquad\blacksquare$

Synchronous value iteration is the special case in which a round is a single simultaneous sweep, and the bound collapses to Chapter 1’s $\norm{V_k-\optvaluefn}_\infty\le\discount^{k}\norm{V_0-\optvaluefn}_\infty$ . The abstract version — monotone contraction operators converge under any fair asynchronous schedule — is the backbone of Bertsekas’s treatment of dynamic programming Bertsekas (2017) .

Gauss–Seidel sweeps and the numerical-analysis analogy

The names are borrowed deliberately. Solving a linear system $Ax=b$ by splitting $A$ and iterating, Jacobi updates every coordinate from the previous iterate (a synchronous sweep), while Gauss–Seidel updates coordinate $i$ using the already-updated coordinates $1,\dots,i-1$ — in place, no second buffer.

Value iteration is the nonlinear analogue: synchronous VI is Jacobi, Gauss–Seidel VI is Gauss–Seidel. The in-place version typically converges in fewer sweeps because information propagates across the state space within a single pass rather than waiting a full sweep per hop, and it halves the memory (no separate

V_{k+1}

buffer). Theorem 2.1 already covers it: a left-to-right sweep updates every state once, so it is exactly one round.

The order within a sweep is now a design lever. If states are numbered along the direction reward information flows — outward from a goal, say — a single Gauss–Seidel sweep can propagate values across the whole space, whereas the reverse order wastes most of the pass. This is the seed of prioritizing the order rather than fixing it.

The Bellman residual and residual scheduling

How do we know where values are still moving? The same quantity that certifies stopping. Recall from Chapter 1 the local Bellman residual at a state, the amount one more backup would change it,

\Delta(s) \defeq \big\lvert (\bellmanopt V)(s) - V(s) \big\rvert ,

and that the global residual $\norm{\bellmanopt V - V}_\infty$ controls the true error through $\norm{V-\optvaluefn}_\infty \le \frac{\discount}{1-\discount} \norm{\bellmanopt V - V}_\infty$ . A uniform sweep spends equal effort on states with $\Delta(s)\approx 0$ (already converged) and on states with large $\Delta(s)$ (still far off). Residual scheduling is the obvious correction: back up states in decreasing order of $\Delta(s)$ , and skip states whose residual sits below a tolerance $\theta$ . Asynchronous convergence (Thm. 2.1) licenses any such order as long as no state is starved.

Prioritized sweeping

Prioritized sweeping Moore & Atkeson (1993) makes residual scheduling precise with a priority queue, and adds the key idea that a backup only changes a state’s predecessors, so only their priorities need refreshing.

initialize V ← 0;  priority queue PQ keyed by Bellman residual
seed PQ with every state s at priority |(𝒯* V)(s) − V(s)|
repeat:
    s ← PQ.pop_max()                      # largest-residual state
    V(s) ← (𝒯* V)(s)                       # one optimality backup
    for each (s̄, a) with p(s | s̄, a) > 0:  # predecessors that feed s
        e ← |(𝒯* V)(s̄) − V(s̄)|             # their refreshed residual
        if e > θ:  PQ.push(s̄, priority = e)
until PQ empty                            # every residual below θ

On a sparse-reward problem the queue concentrates work near the frontier where values are actually changing, leaving the converged interior untouched; Moore and Atkeson report order-of-magnitude reductions in backups to reach a target accuracy on large MDPs. The method is exact-model dynamic programming — it needs $\transition$ to form both the backup and the predecessor sets — but it is the direct ancestor of the sampled prioritized-replay schemes that return in Week 6.

Real-time dynamic programming

Prioritized sweeping orders updates by residual; real-time dynamic programming (RTDP) orders them by reachability Barto et al. (1995) . Instead of enumerating states, RTDP backs up only the states it actually visits while acting greedily, interleaving computation with control.

repeat for each trial:
    s ← start state
    while s is not terminal:
        V(s) ← (𝒯* V)(s)               # back up the current state
        a ← greedy action at s w.r.t. V
        s ← sample s' ∼ p(· | s, a)     # follow the greedy trajectory

Because backups concentrate on the states reachable under good policies, RTDP can solve problems whose full state space is far too large to sweep, converging on the relevant subset without ever touching the rest. It is also the conceptual hinge to model-free learning: replace the expectation $\sum_{s'}\transition(s'\mid s,a)$ in the backup with the single sampled successor, and the deterministic backup becomes a stochastic one — RTDP’s Monte-Carlo cousin is Q-learning, which Weeks 3–4 build from exactly this move.

Counting the work: iterations and samples

How many backups does the discount cost us? For synchronous value iteration the geometric bound answers it directly.

Proposition 2.2 (Iteration complexity of value iteration).

Starting from $V_0=0$ , synchronous value iteration reaches $\norm{V_k-\optvaluefn}_\infty\le\varepsilon$ after

k \;\ge\; \frac{\ln\!\big(R_{\max}\,/\,[\varepsilon(1-\discount)]\big)}{\ln(1/\discount)} \;=\; O\!\Big(\frac{1}{1-\discount}\,\ln\frac{1}{\varepsilon}\Big)

sweeps, where $R_{\max}=\max_{s,a}\lvert\reward(s,a)\rvert$ .

Proof.

With $V_0=0$ , Chapter 1’s Corollary 1.1 gives $\norm{V_k-\optvaluefn}_\infty \le \discount^{k}\norm{\optvaluefn}_\infty \le \discount^{k} R_{\max}/(1-\discount)$ , using the return bound $\norm{\optvaluefn}_\infty\le R_{\max}/(1-\discount)$ . Requiring the right side $\le\varepsilon$ and taking logs gives the displayed $k$ . The order form uses $\ln(1/\discount)\ge 1-\discount$ for $\discount\in(0,1)$ , so $1/\ln(1/\discount)\le 1/(1-\discount)$ . $\qquad\blacksquare$

The count is only logarithmic in the accuracy $\varepsilon$ but linear in the effective horizon $1/(1-\discount)$ — push $\discount$ toward $1$ and the work grows without bound. Tseng Tseng (1990) made this precise, showing stationary discounted MDPs are solved in time proportional to $\log$ of the horizon. The same $1/(1-\discount)$ governs the sample cost once the model is unknown: given only a generative model (a sampler for $\transition$ ), estimating an $\varepsilon$ -optimal value needs $\tilde O\!\big(\lvert\statespace\rvert\lvert\actionspace\rvert (1-\discount)^{-3}\varepsilon^{-2}\big)$ samples, a bound matched from both sides by Azar et al. Azar et al. (2013) and achieved in near-optimal time by the variance-reduced value iteration of Sidford et al. Sidford et al. (2018) . That cubic blow-up in the horizon is precisely the pressure that drives the curriculum out of the exact-model regime and into sampling (Week 3) and function approximation (Week 5).

The dynamic-programming bridge

Asynchronous DP is where the numerical-analysis reading of value iteration becomes literal. A value function is a point in $\R^{\statespace}$ ; the Bellman operator is a nonlinear map whose fixed point we chase; and the Jacobi / Gauss–Seidel / residual-ordered hierarchy is the same one used to solve large sparse linear systems, transplanted to a contraction that happens to involve a $\max$ . Two threads carry forward:

To online control (MPC, Week 15). RTDP’s instinct — compute only over the states you will actually visit, refreshing as you go — is the tabular shadow of model predictive control, which re-solves a short-horizon optimal-control problem from the current state at every step instead of storing $\optvaluefn$ everywhere.
To model-free RL (Weeks 3–6). Replace the model expectation in a backup with a sampled successor and asynchronous DP becomes temporal-difference learning; prioritized sweeping becomes prioritized experience replay. The schedule ideas of this chapter reappear, now applied to which transitions to learn from.

What’s next

Week 3 removes the known model entirely: values are estimated from sampled returns (Monte Carlo), trading the model expectation for an empirical average and introducing the bias–variance questions that dominate the rest of Part I.
Week 4 fuses the two views — bootstrapping like DP, sampling like Monte Carlo — into temporal-difference learning, the stochastic-approximation form of the asynchronous backup above.

Exercises

(Prove) Show that the Bellman optimality operator satisfies the constant-shift identity $\bellmanopt(v+c\,\mathbf 1)=\bellmanopt v+\discount c\,\mathbf 1$ for every $c\in\R$ (Prop. 2.1(2)), and explain where the proof of Theorem 2.1 uses it.

Solution
For each $(s,a)$ , $\sum_{s'}\transition(s'\mid s,a)\,(v(s')+c)=\sum_{s'} \transition(s'\mid s,a)v(s')+c$ since the transition row sums to $1$ , so every $Q_v(s,a)$ rises by $\discount c$ ; the per-state $\max_a$ then rises by $\discount c$ . Theorem 2.1 uses it to evaluate $\bellmanopt(\optvaluefn\pm c\,\mathbf 1)=\optvaluefn\pm\discount c\,\mathbf 1$ , which shrinks the bounding box by a factor $\discount$ each round.
(Derive) Argue that one full Gauss–Seidel sweep is one “round” in the sense of Theorem 2.1, and conclude that Gauss–Seidel value iteration converges at least as fast (per sweep) as synchronous value iteration.

Solution
A left-to-right sweep updates each state exactly once, so by the end every state has been backed up since the sweep began — a round — and Theorem 2.1 gives $\norm{V-\optvaluefn}_\infty\le\discount\norm{V_{\text{pre}}-\optvaluefn}_\infty$ . Synchronous VI achieves the same per-sweep factor but reads only stale values; Gauss–Seidel additionally uses freshly updated predecessors within the sweep, so its actual contraction per sweep is no worse and usually strictly better.
(Compute) A backup at state $s$ changes the residual of which other states? Give the set in terms of $\transition$ , and explain why prioritized sweeping only re-queues those.

Solution
Only the predecessors $\{\bar s : \exists a,\ \transition(s\mid\bar s,a)>0\}$ : a state $\bar s$ ‘s backup depends on $V(s)$ only if some action from $\bar s$ can reach $s$ . Every other state’s residual is unchanged by overwriting $V(s)$ , so re-evaluating it would be wasted work — hence the predecessor loop in the prioritized-sweeping inner step.
(Implement) In the companion, run synchronous VI, Gauss–Seidel VI, and prioritized sweeping on the stochastic gridworld and verify (i) all three reach the policy-iteration value $\optvaluefn$ , and (ii) to reach a fixed target accuracy $\varepsilon$ — set the prioritized-sweeping threshold $\theta=\varepsilon(1-\discount)/\discount$ — prioritized sweeping spends fewer state-backups than the synchronous sweep count $\lvert\statespace\rvert\times(\text{sweeps})$ .

Solution
See experiments/python/week02/test_async_dp.py: it asserts each method matches dp.policy_iteration, and that at $\varepsilon=10^{-3}$ the prioritized-sweeping backup count is below the synchronous total. The win is in reaching useful accuracy fast — the queue concentrates on high-residual states — and it widens with $\lvert\statespace\rvert$ ; the edge narrows as $\varepsilon$ approaches machine precision, where residuals near the fixed point thrash the queue.
(Extend) Perturb the gridworld’s transition model (raise the slip probability) and measure how the optimal policy and $\optvaluefn$ change. Relate the sensitivity to the $1/(1-\discount)$ factor in Proposition 2.2.

Solution
A model error of size $\delta$ in $\transition$ perturbs the fixed point by $O\!\big(\discount\delta/(1-\discount)\big)$ (a simulation-lemma bound — a perturbation bound on the DP fixed point): the same effective horizon that sets the iteration count also amplifies model error, which is the quantitative reason model misspecification hurts more as $\discount\to1$ . The companion’s --slip sweep exhibits the effect.

Companion code

The Week-2 companion lives at experiments/python/week02/ and reuses the Week-1 primitives — it imports dp.py for action_values, bellman_optimality_operator, value_iteration, and policy_iteration rather than reimplementing them, so the asynchronous methods are checked against the same reference fixed point.

gridworld.py — builds a parametric stochastic gridworld as a generic (P, R, gamma) MDP in the dp.py representation: a goal cell, a per-step cost, and a slip probability that randomizes the intended move. Pure NumPy.
async_dp.py — Gauss–Seidel (in-place) value iteration and model-based prioritized sweeping with an explicit predecessor index and a backup counter, both built on the dp.py operators.
test_async_dp.py — mathematical-correctness tests: Gauss–Seidel and prioritized sweeping converge to the same $\optvaluefn$ as value_iteration and policy_iteration (from arbitrary starts); the constant-shift and monotonicity properties of Proposition 2.1 hold numerically; and prioritized sweeping uses fewer backups than a synchronous sweep schedule to reach a target accuracy on a sparse-reward grid.

# core algorithms + correctness tests (pure NumPy, no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week02/test_async_dp.py -q

# worked comparison + optional value/residual heatmaps (saved locally, not committed)
PYTHONPATH=. python experiments/python/week02/async_dp.py --plot