Part I · Foundations Week 1 Published dp.py test_dp.py

Markov Decision Processes and Dynamic Programming

The finite MDP, the Bellman expectation and optimality equations, and the gamma-contraction that makes value iteration and policy iteration converge — dynamic programming as the spine the rest of the curriculum returns to.

On this page

Notation and conventions
The finite Markov decision process
Value functions
The Bellman expectation equation
Optimality: the Bellman optimality equation
The contraction that makes it all work
Value iteration
Policy iteration
The dynamic-programming bridge
What’s next
Exercises
Companion code

Markov Decision Processes and Dynamic Programming

Where we are. This is the first chapter, and it states the object the whole curriculum orbits: the finite Markov decision process and the two equations that pin down optimal behaviour in it. The claim is small and load-bearing — the optimal value function is the unique fixed point of a $\discount$ -contraction, and value iteration and policy iteration are two ways of reaching that fixed point. Everything later — LQR, MPC, model-based and model-free RL — computes, approximates, or learns this same fixed point when the MDP is too large, too continuous, or unknown.

Chapter 1 — at a glance

Goal. Define the finite MDP; derive the Bellman expectation and optimality equations; prove the Bellman operators are $\discount$ -contractions; read value iteration and policy iteration off the contraction.

Reading time. ~45 minutes; ~75 with the proofs and exercises.

Key insight — the DP bridge. A value function is a fixed point. The optimality operator $\bellmanopt$ contracts the sup-norm by $\discount$ , so by the Banach fixed-point theorem it has exactly one fixed point, $\optvaluefn$ , and every iterate converges to it geometrically. This one fact is the spine of the book: LQR (Ch. 13) is this fixed point in closed form on linear-quadratic dynamics; MPC (Ch. 15) is this fixed point recomputed online over a finite horizon; model-free RL (Ch. 5+) is stochastic approximation of the same operator from samples.

Notation and conventions

We use the case convention of Sutton & Barto Sutton & Barto (2018) throughout: lowercase names a true object we want ( $\valuefn_\policy$ ), and uppercase names an estimate we hold and update ( $V_k$ ).

| Symbol | Meaning | |---|---| | $\statespace,\ \actionspace$ | finite state and action sets; $\lvert\statespace\rvert, \lvert\actionspace\rvert < \infty$ | | $\transition(s',r \mid s,a)$ | dynamics — joint probability of next state $s'$ and reward $r$ | | $\policy(a \mid s)$ | policy — probability of action $a$ in state $s$ | | $\discount \in [0,1)$ | discount factor (strict inequality is load-bearing) | | $\valuefn_\policy,\ \qfn_\policy$ | state- and action-value functions of $\policy$ (true) | | $\optvaluefn,\ \optqfn$ | optimal value functions | | $V_k \in \R^{\statespace}$ | the $k$ -th value iterate (an estimate of $\optvaluefn$ ) | | $\bellman^\policy,\ \bellmanopt$ | Bellman expectation / optimality operators | | $\norm{\cdot}_\infty$ | sup norm on $\R^{\statespace}$ : $\norm{v}_\infty \defeq \max_{s} \lvert v(s)\rvert$ |

A value function on a finite state set is a vector in $\R^{\statespace}$ , a point in $\lvert\statespace\rvert$ -dimensional space.

The operators below move that point around, and “solving the MDP” means finding the one point they hold still.

The finite Markov decision process

A Markov decision process is a controlled Markov chain: at each step the agent sees a state, picks an action, and the environment returns a reward and a next state, with the Markov property that the future depends on the past only through the current state.

Definition 1.1 (Finite MDP).

A finite Markov decision process is a tuple $(\statespace, \actionspace, \transition, \discount)$ with $\statespace, \actionspace$ finite, discount $\discount \in [0,1)$ , and dynamics

\transition(s',r \mid s,a) \defeq \Prob\!\big(S_{t+1}=s',\, R_{t+1}=r \,\big|\, S_t=s,\, A_t=a\big),

a probability distribution over $(s',r)$ for each $(s,a)$ . We write the state-transition kernel and expected reward as the marginals

\transition(s' \mid s,a) \defeq \sum_{r} \transition(s',r \mid s,a), \qquad \reward(s,a) \defeq \sum_{s',r} \transition(s',r \mid s,a)\, r .

The Markov property is the modelling commitment that earns us everything that follows: because the next state and reward depend only on $(S_t, A_t)$ , a value function needs only one argument — the current state — and the recursion below closes on itself.

The canonical reference for the finite-MDP formalism and its solution methods is Puterman Puterman (1994) ; the founding treatment of the underlying optimization principle is Bellman Bellman (1957) .

Definition 1.2 (Policy and discounted return).

A (stationary) policy $\policy(a\mid s)$ is a distribution over actions for each state. Acting under $\policy$ produces a trajectory $S_0, A_0, R_1, S_1, A_1, R_2, \dots$ , and the discounted return from time $t$ is

G_t \defeq \sum_{k=0}^{\infty} \discount^{k} R_{t+k+1}.

Because rewards are bounded on a finite MDP and $\discount<1$ , the series converges absolutely, with $\lvert G_t\rvert \le R_{\max}/(1-\discount)$ where $R_{\max} \defeq \max_{s,a}\lvert\reward(s,a)\rvert$ .

Value functions

A value function answers one question: starting here and acting under $\policy$ , how much discounted reward do I expect? The state-value fixes the starting state; the action-value also fixes the first action.

Definition 1.3 (State- and action-value functions).

For a policy $\policy$ , the state-value and action-value functions are

\valuefn_\policy(s) \defeq \E_\policy\!\big[\,G_t \,\big|\, S_t = s\,\big], \qquad \qfn_\policy(s,a) \defeq \E_\policy\!\big[\,G_t \,\big|\, S_t = s,\, A_t = a\,\big],

where $\E_\policy$ averages over trajectories generated by $\policy$ and $\transition$ . The two are linked by $\valuefn_\policy(s) = \sum_a \policy(a\mid s)\,\qfn_\policy(s,a)$ .

The Bellman expectation equation

Everything recursive about value functions comes from one move: split the return into the immediate reward and the discounted return from the next state, then condition on what happens in one step. That conditioning is the law of total expectation, the single most-used tool in RL theory.

Theorem 1.1 (Bellman expectation equation).

For every policy $\policy$ and state $s$ ,

\valuefn_\policy(s) = \sum_{a} \policy(a\mid s) \sum_{s',r} \transition(s',r\mid s,a) \big[\, r + \discount\, \valuefn_\policy(s') \,\big].

Proof.

Start from the definition and peel off one step of the return, $G_t = R_{t+1} + \discount\, G_{t+1}$ :

\begin{aligned} \valuefn_\policy(s) &= \E_\policy[\,R_{t+1} + \discount\, G_{t+1} \mid S_t = s\,] && \text{(Def. 1.3; unroll }G_t\text{)} \\ &= \sum_{a} \policy(a\mid s) \sum_{s',r} \transition(s',r\mid s,a)\, \E_\policy[\,r + \discount\, G_{t+1} \mid S_{t+1}=s'\,] && \text{(law of total expectation, condition on }A_t, S_{t+1}, R_{t+1}\text{)} \\ &= \sum_{a} \policy(a\mid s) \sum_{s',r} \transition(s',r\mid s,a)\, \big[\, r + \discount\, \E_\policy[\,G_{t+1}\mid S_{t+1}=s'\,] \,\big] && \text{(linearity; }r\text{ is fixed once we condition)} \\ &= \sum_{a} \policy(a\mid s) \sum_{s',r} \transition(s',r\mid s,a)\, \big[\, r + \discount\, \valuefn_\policy(s') \,\big] && \text{(Markov property: }\E_\policy[G_{t+1}\mid S_{t+1}=s']=\valuefn_\policy(s')\text{).} \end{aligned}

The Markov property is what licenses the last line: the expected return from $t+1$ onward depends on the past only through $S_{t+1}=s'$ . $\qquad\blacksquare$

Read as a system of $\lvert\statespace\rvert$ linear equations in the unknowns $\{\valuefn_\policy(s)\}$ , the Bellman expectation equation defines $\valuefn_\policy$ . Collecting it into a single map on $\R^{\statespace}$ gives the operator we will iterate.

Definition 1.4 (Bellman expectation operator).

The Bellman expectation operator $\bellman^\policy : \R^{\statespace} \to \R^{\statespace}$ acts on a value vector $v$ by

(\bellman^\policy v)(s) \defeq \sum_{a} \policy(a\mid s) \sum_{s',r} \transition(s',r\mid s,a)\,\big[\, r + \discount\, v(s') \,\big] = \reward_\policy(s) + \discount \sum_{s'} \transition_\policy(s'\mid s)\, v(s'),

where $\reward_\policy(s) \defeq \sum_a \policy(a\mid s)\reward(s,a)$ and $\transition_\policy(s'\mid s) \defeq \sum_a \policy(a\mid s)\transition(s'\mid s,a)$ . By Theorem 1.1, $\valuefn_\policy$ is a fixed point: $\bellman^\policy \valuefn_\policy = \valuefn_\policy$ .

In vector form $\bellman^\policy v = \reward_\policy + \discount P_\policy v$ with $P_\policy$ the $\lvert\statespace\rvert\times\lvert\statespace\rvert$ transition matrix — an affine map, so its fixed point solves the linear system $(I - \discount P_\policy)\valuefn_\policy = \reward_\policy$ . We will use that closed form for policy evaluation below.

Optimality: the Bellman optimality equation

So far $\policy$ was fixed. Optimal control asks for the best achievable value, and — remarkably — a single policy attains it simultaneously in every state.

Definition 1.5 (Optimal value functions and optimal policy).

The optimal value functions are

\optvaluefn(s) \defeq \max_{\policy} \valuefn_\policy(s), \qquad \optqfn(s,a) \defeq \max_{\policy} \qfn_\policy(s,a),

the maxima taken pointwise over all policies. A policy $\policy_*$ is optimal if $\valuefn_{\policy_*} = \optvaluefn$ .

Theorem 1.2 (Bellman optimality equation).

The optimal state-value function satisfies, for every $s$ ,

\optvaluefn(s) = \max_{a \in \actionspace} \Big[\, \reward(s,a) + \discount \sum_{s'} \transition(s'\mid s,a)\, \optvaluefn(s') \,\Big] = \max_{a} \optqfn(s,a),

and a policy is optimal iff it is greedy with respect to $\optvaluefn$ — i.e. in each state it puts all mass on actions attaining the maximum.

Proof.

We argue from Bellman’s principle of optimality Bellman (1957) , assuming for now that $\optvaluefn$ is well defined; the next section discharges that assumption — Theorem 1.3 and Corollary 1.1 establish existence and uniqueness without using this equation, so the reasoning is not circular. An optimal trajectory’s tail is itself optimal from the state it reaches. Fix $s$ . Any policy chooses a first action (or distribution over them) and then continues; its value cannot exceed taking the best first action $a$ and continuing optimally, which is exactly $\max_a [\reward(s,a) + \discount \sum_{s'} \transition(s'\mid s,a)\,\optvaluefn(s')]$ — so $\optvaluefn(s)$ is at most the right-hand side. Conversely, the policy that takes the maximizing $a$ and then follows an optimal policy achieves that value, so $\optvaluefn(s)$ is at least the right-hand side. Equality gives the optimality equation; the two bounds coincide exactly when the policy is greedy for $\optvaluefn$ . $\qquad\blacksquare$

The $\max$ over actions makes this equation nonlinear — there is no matrix inverse to solve it.

That nonlinearity is the entire reason we iterate rather than solve. Bertsekas Bertsekas (2017) develops the resulting theory abstractly as fixed-point iteration of monotone contraction operators; we specialize it to finite MDPs.

Definition 1.6 (Bellman optimality operator).

The Bellman optimality operator $\bellmanopt : \R^{\statespace} \to \R^{\statespace}$ is

(\bellmanopt v)(s) \defeq \max_{a \in \actionspace} \Big[\, \reward(s,a) + \discount \sum_{s'} \transition(s'\mid s,a)\, v(s') \,\Big].

Theorem 1.2 says $\optvaluefn$ is a fixed point: $\bellmanopt \optvaluefn = \optvaluefn$ .

The contraction that makes it all work

Both operators share one property, and it is the technical heart of the chapter. Recall a map $T$ is a $\discount$ -contraction in a norm if $\norm{Tu - Tv} \le \discount\,\norm{u-v}$ for all $u,v$ .

Theorem 1.3 (Bellman operators are sup-norm contractions).

For any $u, v \in \R^{\statespace}$ ,

\norm{\bellmanopt u - \bellmanopt v}_\infty \le \discount\,\norm{u - v}_\infty, \qquad \norm{\bellman^\policy u - \bellman^\policy v}_\infty \le \discount\,\norm{u - v}_\infty .

Both operators are $\discount$ -contractions on $(\R^{\statespace}, \norm{\cdot}_\infty)$ .

Proof.

We prove it for $\bellmanopt$ ; the $\bellman^\policy$ case is Exercise 1 (it is easier — no $\max$ ). The bound takes four short steps, each labelled with the rule it uses. Fix a state $s$ and write $Q_w(s,a) \defeq \reward(s,a) + \discount \sum_{s'} \transition(s'\mid s,a)\,w(s')$ , so $(\bellmanopt w)(s) = \max_a Q_w(s,a)$ . The one inequality we need is that the $\max$ is nonexpansive: $\lvert \max_a f(a) - \max_a g(a)\rvert \le \max_a \lvert f(a)-g(a)\rvert$ . Then

\begin{aligned} \big\lvert (\bellmanopt u)(s) - (\bellmanopt v)(s) \big\rvert &= \big\lvert \max_a Q_u(s,a) - \max_a Q_v(s,a) \big\rvert && \text{(definition)} \\ &\le \max_a \big\lvert Q_u(s,a) - Q_v(s,a) \big\rvert && \text{($\max$ is nonexpansive)} \\ &= \max_a\ \discount \Big\lvert \textstyle\sum_{s'} \transition(s'\mid s,a)\,\big(u(s') - v(s')\big) \Big\rvert && \text{($\reward(s,a)$ cancels)} \\ &\le \discount \max_a \sum_{s'} \transition(s'\mid s,a)\, \big\lvert u(s') - v(s') \big\rvert && \text{(triangle inequality)} \\ &\le \discount \max_a \sum_{s'} \transition(s'\mid s,a)\, \norm{u - v}_\infty = \discount\, \norm{u - v}_\infty && \text{($\textstyle\sum_{s'} \transition(s'\mid s,a) = 1$).} \end{aligned}

The bound is uniform in $s$ , so taking the max over $s$ on the left gives $\norm{\bellmanopt u - \bellmanopt v}_\infty \le \discount\norm{u-v}_\infty$ . $\qquad\blacksquare$

The discount $\discount < 1$ is doing all the work: it is literally the contraction modulus. With $\discount = 1$ the bound is vacuous and the fixed-point theory needs extra structure (proper policies, average cost) — the subject of later asides.

Corollary 1.1 (Unique value functions and geometric convergence).

Because $(\R^{\statespace}, \norm{\cdot}_\infty)$ is complete and both operators are $\discount$ -contractions, the Banach fixed-point theorem gives:

$\bellman^\policy$ has a unique fixed point, necessarily $\valuefn_\policy$ ; and $\bellmanopt$ has a unique fixed point, necessarily $\optvaluefn$ . In particular $\optvaluefn$ exists and is unique, and an optimal stationary policy exists (any greedy policy for $\optvaluefn$ — the policy-improvement theorem below confirms such a policy attains $\optvaluefn$ ).
The iterates $V_{k+1} \defeq \bellmanopt V_k$ converge to $\optvaluefn$ from any start $V_0$ , geometrically: $\norm{V_k - \optvaluefn}_\infty \le \discount^{k}\, \norm{V_0 - \optvaluefn}_\infty .$

This corollary is the payoff. It converts an existence question (“is there a best policy?”) into a convergent algorithm (“iterate the operator”), and it tells us the error shrinks by a factor $\discount$ every sweep.

Value iteration

Value iteration is now nothing more than iterate $\bellmanopt$ to its fixed point.

V ← 0  (any initial vector in ℝ^𝒮)
repeat:
    V ← 𝒯* V            # one sweep of the optimality operator
until ‖ΔV‖∞ < ε(1−γ)/γ   # stopping rule, justified below
return V, and the greedy policy π(s) = argmax_a [ r(s,a) + γ Σ p(s'|s,a) V(s') ]

Corollary 1.1 guarantees convergence. The stopping rule deserves a word: a small Bellman residual $\norm{V_{k+1} - V_k}_\infty$ bounds the true error, because

\norm{V_{k} - \optvaluefn}_\infty \le \frac{\discount}{1-\discount}\,\norm{V_{k} - V_{k-1}}_\infty .

This telescopes from the contraction — write $V_k - \optvaluefn = \sum_{j\ge k}(V_{j+1} - V_j)$ and bound each term geometrically (Exercise 5). So stopping when $\norm{\Delta V}_\infty < \varepsilon(1-\discount)/\discount$ certifies $\norm{V - \optvaluefn}_\infty < \varepsilon$ — a guarantee we can check at runtime without knowing $\optvaluefn$ . The companion experiments/python/week01/dp.py implements exactly this loop with the explicit operator as a standalone function, as the roadmap’s Week-1 task asks.

Policy iteration

Value iteration improves the value every sweep and reads off a policy at the end. Policy iteration instead alternates two exact steps on the policy, and was Howard’s original 1960 algorithm Howard (1960) :

Policy evaluation. Given $\policy$ , solve the linear system $(I - \discount P_\policy)\,\valuefn_\policy = \reward_\policy$ for $\valuefn_\policy$ (the affine fixed point of $\bellman^\policy$ — Def. 1.4).
Policy improvement. Set $\policy'(s) = \argmax_a \qfn_\policy(s,a)$ , greedy w.r.t. the freshly evaluated values.

Repeat until the policy stops changing. The step that makes this work is:

Theorem 1.4 (Policy improvement theorem).

Let $\policy'$ be greedy with respect to $\valuefn_\policy$ , so that $\qfn_\policy(s, \policy'(s)) \ge \valuefn_\policy(s)$ for all $s$ . Then $\valuefn_{\policy'}(s) \ge \valuefn_\policy(s)$ for all $s$ , with strict improvement in some state unless $\policy$ is already optimal.

Proof.

Expand the greedy inequality and re-apply it along the trajectory:

\begin{aligned} \valuefn_\policy(s) &\le \qfn_\policy(s, \policy'(s)) && \text{(greedy choice)} \\ &= \E_{\policy'}\!\big[\, R_{t+1} + \discount\, \valuefn_\policy(S_{t+1}) \mid S_t = s \,\big] && \text{(definition of }\qfn_\policy\text{ under the first action }\policy'\text{)} \\ &\le \E_{\policy'}\!\big[\, R_{t+1} + \discount\, \qfn_\policy(S_{t+1}, \policy'(S_{t+1})) \mid S_t = s \,\big] && \text{(apply the greedy inequality at }S_{t+1}\text{)} \\ &\le \cdots \le \E_{\policy'}\!\Big[\, \textstyle\sum_{k\ge 0} \discount^{k} R_{t+k+1} \mid S_t = s \,\Big] = \valuefn_{\policy'}(s) && \text{(unroll; the tail telescopes to }\valuefn_{\policy'}\text{).} \end{aligned}

If no state improves strictly, the greedy inequality holds with equality everywhere, which is the Bellman optimality equation — so $\policy$ is already optimal. $\qquad\blacksquare$

Because the MDP is finite there are finitely many deterministic policies, each iteration strictly improves until none does, so policy iteration terminates at an exact optimum in finitely many steps. Value iteration and policy iteration are the two poles of what Sutton & Barto Sutton & Barto (2018) call generalized policy iteration: any interleaving of “make the value consistent with the policy” and “make the policy greedy for the value” converges to the same fixed point. Week 2 studies what happens between the poles — asynchronous, Gauss–Seidel, and prioritized sweeps.

The dynamic-programming bridge

The roadmap’s Week-1 writing prompt is to place the Bellman equation beside its continuous cousin. They are the same principle of optimality at different limits. Discretize time and state and the cost-to-go obeys the discrete Bellman recursion above; take the limit of vanishing time-step on a continuous state and the same recursion becomes the Hamilton–Jacobi–Bellman equation. Control flips the convention from maximizing reward to minimizing a running cost

\ell = -\reward

, with cost-to-go

J

; the stationary discounted HJB equation is

\rho\, J(x) = \min_{a}\Big[\, \ell(x,a) + \nabla J(x)^{\!\top} f(x,a) \,\Big], \qquad \rho = -\ln\discount,

whose $\rho J$ term is the continuous-time echo of the discount $\discount$ that drove the contraction above (set $\rho = 0$ for the undiscounted limit). Value iteration is the fixed-point solver for the discrete equation; the LQR Riccati recursion (Ch. 13) is the closed-form solver for the HJB equation when $f$ is linear and $\ell$ is quadratic. Holding this identity in view is the point of the whole curriculum: control theory and RL are two dialects for the same fixed point.

What’s next

Week 2 opens up the iteration itself: asynchronous and Gauss–Seidel value iteration, prioritized sweeping, real-time DP, and residual scheduling — the practical face of “iterate the contraction.”
Week 3 replaces the known model $\transition$ with sampled returns (Monte Carlo), the first step away from dynamic programming toward learning.

Exercises

(Prove) Show that the Bellman expectation operator $\bellman^\policy$ is a $\discount$ -contraction in $\norm{\cdot}_\infty$ . (This is the no- $\max$ case of Theorem 1.3.)

Solution
For any $s$ , $(\bellman^\policy u)(s) - (\bellman^\policy v)(s) = \discount \sum_{s'} \transition_\policy(s'\mid s)\,(u(s') - v(s'))$ since $\reward_\policy(s)$ cancels. Taking absolute values and using the triangle inequality with $\sum_{s'}\transition_\policy(s'\mid s)=1$ , $\lvert(\bellman^\policy u)(s) - (\bellman^\policy v)(s)\rvert \le \discount \sum_{s'}\transition_\policy(s'\mid s)\norm{u-v}_\infty = \discount\norm{u-v}_\infty$ . Maximizing over $s$ gives the claim. (No $\max$ -nonexpansiveness step is needed, because $\bellman^\policy$ is affine.)
(Derive) State and prove the Bellman expectation equation for the action-value $\qfn_\policy(s,a)$ .

Solution
Conditioning on the first transition and then the next action,
$\qfn_\policy(s,a) = \sum_{s',r}\transition(s',r\mid s,a)\Big[\,r + \discount \sum_{a'}\policy(a'\mid s')\,\qfn_\policy(s',a')\,\Big].$
The derivation is identical to Theorem 1.1 with the first action held fixed at $a$ and the recursion closing on $\qfn_\policy$ via $\valuefn_\policy(s') = \sum_{a'}\policy(a'\mid s')\qfn_\policy(s',a')$ .
(Compute) Take the two-state MDP $\statespace=\{A,B\}$ , actions $\{\text{stay},\text{switch}\}$ , $\discount=0.9$ , with deterministic dynamics: in $A$ , stay $\to A$ ( $r=0$ ) and switch $\to B$ ( $r=0$ ); in $B$ , stay $\to B$ ( $r=1$ ) and switch $\to A$ ( $r=0$ ). Evaluate the always-stay policy by solving $(I-\discount P_\policy)\valuefn_\policy = \reward_\policy$ .

Solution
Under always-stay, $P_\policy = I$ and $\reward_\policy = (0, 1)$ , so $(I - 0.9 I)\valuefn_\policy = (0,1)$ gives $0.1\,\valuefn_\policy = (0,1)$ , i.e. $\valuefn_\policy = (0, 10)$ . State $B$ self-loops collecting $1$ forever ( $1/(1-\discount)=10$ ); state $A$ self-loops collecting $0$ . This is the exact case the companion test asserts.
(Prove) Derive the geometric error bound of Corollary 1.1(2), $\norm{V_k - \optvaluefn}_\infty \le \discount^{k}\norm{V_0 - \optvaluefn}_\infty$ , from the contraction property and $\bellmanopt\optvaluefn=\optvaluefn$ .

Solution
$\norm{V_k - \optvaluefn}_\infty = \norm{\bellmanopt V_{k-1} - \bellmanopt \optvaluefn}_\infty \le \discount\norm{V_{k-1}-\optvaluefn}_\infty$ by Theorem 1.3 and the fixed-point identity. Iterating the inequality $k$ times gives the bound.
(Implement) Add the certified stopping rule to value iteration: stop when $\norm{\Delta V}_\infty < \varepsilon(1-\discount)/\discount$ and verify empirically that the returned $V$ satisfies $\norm{V-\optvaluefn}_\infty < \varepsilon$ on the two-state MDP (using the analytic $\optvaluefn=(9,10)$ ). Run against experiments/python/week01/dp.py.

Solution
The bound follows by writing $V_k - \optvaluefn = \sum_{j\ge k}(V_{j+1}-V_j)$ and bounding each term by the contraction: $\norm{V_{j+1}-V_j}_\infty \le \discount^{\,j-k}\norm{V_{k}-V_{k-1}}_\infty$ , a geometric series summing to $\frac{\discount}{1-\discount}\norm{\Delta V}_\infty$ . The companion’s value_iteration(..., tol=ε) implements this and its test asserts the resulting error is below $\varepsilon$ .
(Extend) Show the optimal value function is the solution of the linear program $\min_{v}\sum_s v(s)$ subject to $v(s) \ge \reward(s,a) + \discount \sum_{s'}\transition(s'\mid s,a)v(s')$ for all $s,a$ .

Solution
Feasibility means $v \ge \bellmanopt v$ pointwise; by monotonicity of $\bellmanopt$ this implies $v \ge \optvaluefn$ , so $\optvaluefn$ is the smallest feasible point and minimizing $\sum_s v(s)$ selects it. The constraints linearize the $\max$ (one inequality per action), turning the nonlinear optimality equation into an LP — the basis of the dual/occupancy-measure view revisited with safe RL in Week 25.

Companion code

The Week-1 companion lives at experiments/python/week01/ (the repo’s three-language convention). It is pure NumPy at the core — the algorithms and the correctness test carry no environment dependency — with an optional Gymnasium adapter for the canonical FrozenLake showcase.

dp.py — the Bellman optimality operator as an explicit function, plus value_iteration, policy_evaluation (exact linear solve), policy_improvement, and policy_iteration, all on a generic finite MDP (P, R, gamma).
test_dp.py — mathematical-correctness tests: convergence to the analytic $\optvaluefn=(9,10)$ on the two-state MDP of Exercise 3, the contraction inequality of Theorem 1.3, the fixed-point identity, the geometric bound, and VI/PI agreement (FrozenLake checks run only when gymnasium is installed).
frozenlake.py — builds (P, R, gamma) from gymnasium’s FrozenLake-v1 and runs VI and PI as a worked showcase.

# core algorithms + correctness tests (no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week01/test_dp.py -q

# canonical FrozenLake showcase (optional extra: pip install "gymnasium")
PYTHONPATH=. python experiments/python/week01/frozenlake.py