Part I · Foundations Week 1 Published dp.py test_dp.py

Markov Decision Processes and Dynamic Programming

The finite MDP, the Bellman expectation and optimality equations, and the gamma-contraction that makes value iteration and policy iteration converge — dynamic programming as the spine the rest of the curriculum returns to.

On this page
  1. Notation and conventions
  2. The finite Markov decision process
  3. Value functions
  4. The Bellman expectation equation
  5. Optimality: the Bellman optimality equation
  6. The contraction that makes it all work
  7. Value iteration
  8. Policy iteration
  9. The dynamic-programming bridge
  10. What’s next
  11. Exercises
  12. Companion code

Markov Decision Processes and Dynamic Programming

Where we are. This is the first chapter, and it states the object the whole curriculum orbits: the finite Markov decision process and the two equations that pin down optimal behaviour in it. The claim is small and load-bearing — the optimal value function is the unique fixed point of a γ\discount-contraction, and value iteration and policy iteration are two ways of reaching that fixed point. Everything later — LQR, MPC, model-based and model-free RL — computes, approximates, or learns this same fixed point when the MDP is too large, too continuous, or unknown.

Notation and conventions

We use the case convention of Sutton & Barto Sutton & Barto (2018) throughout: lowercase names a true object we want (vπ\valuefn_\policy), and uppercase names an estimate we hold and update (VkV_k).

| Symbol | Meaning | |---|---| | S, A\statespace,\ \actionspace | finite state and action sets; S,A<\lvert\statespace\rvert, \lvert\actionspace\rvert < \infty | | p(s,rs,a)\transition(s',r \mid s,a) | dynamics — joint probability of next state ss' and reward rr | | π(as)\policy(a \mid s) | policy — probability of action aa in state ss | | γ[0,1)\discount \in [0,1) | discount factor (strict inequality is load-bearing) | | vπ, qπ\valuefn_\policy,\ \qfn_\policy | state- and action-value functions of π\policy (true) | | v, q\optvaluefn,\ \optqfn | optimal value functions | | VkRSV_k \in \R^{\statespace} | the kk-th value iterate (an estimate of v\optvaluefn) | | Tπ, T\bellman^\policy,\ \bellmanopt | Bellman expectation / optimality operators | | \norm{\cdot}_\infty | sup norm on RS\R^{\statespace}: vmaxsv(s)\norm{v}_\infty \defeq \max_{s} \lvert v(s)\rvert |

A value function on a finite state set is a vector in RS\R^{\statespace}, a point in S\lvert\statespace\rvert-dimensional space.

The operators below move that point around, and “solving the MDP” means finding the one point they hold still.

The finite Markov decision process

A Markov decision process is a controlled Markov chain: at each step the agent sees a state, picks an action, and the environment returns a reward and a next state, with the Markov property that the future depends on the past only through the current state.

Definition 1.1 (Finite MDP).

A finite Markov decision process is a tuple (S,A,p,γ)(\statespace, \actionspace, \transition, \discount) with S,A\statespace, \actionspace finite, discount γ[0,1)\discount \in [0,1), and dynamics

p(s,rs,a)P ⁣(St+1=s,Rt+1=rSt=s,At=a),\transition(s',r \mid s,a) \defeq \Prob\!\big(S_{t+1}=s',\, R_{t+1}=r \,\big|\, S_t=s,\, A_t=a\big),

a probability distribution over (s,r)(s',r) for each (s,a)(s,a). We write the state-transition kernel and expected reward as the marginals

p(ss,a)rp(s,rs,a),r(s,a)s,rp(s,rs,a)r.\transition(s' \mid s,a) \defeq \sum_{r} \transition(s',r \mid s,a), \qquad \reward(s,a) \defeq \sum_{s',r} \transition(s',r \mid s,a)\, r .

The Markov property is the modelling commitment that earns us everything that follows: because the next state and reward depend only on (St,At)(S_t, A_t), a value function needs only one argument — the current state — and the recursion below closes on itself.

The canonical reference for the finite-MDP formalism and its solution methods is Puterman Puterman (1994) ; the founding treatment of the underlying optimization principle is Bellman Bellman (1957) .

Definition 1.2 (Policy and discounted return).

A (stationary) policy π(as)\policy(a\mid s) is a distribution over actions for each state. Acting under π\policy produces a trajectory S0,A0,R1,S1,A1,R2,S_0, A_0, R_1, S_1, A_1, R_2, \dots, and the discounted return from time tt is

Gtk=0γkRt+k+1.G_t \defeq \sum_{k=0}^{\infty} \discount^{k} R_{t+k+1}.

Because rewards are bounded on a finite MDP and γ<1\discount<1, the series converges absolutely, with GtRmax/(1γ)\lvert G_t\rvert \le R_{\max}/(1-\discount) where Rmaxmaxs,ar(s,a)R_{\max} \defeq \max_{s,a}\lvert\reward(s,a)\rvert.

Value functions

A value function answers one question: starting here and acting under π\policy, how much discounted reward do I expect? The state-value fixes the starting state; the action-value also fixes the first action.

Definition 1.3 (State- and action-value functions).

For a policy π\policy, the state-value and action-value functions are

vπ(s)Eπ ⁣[GtSt=s],qπ(s,a)Eπ ⁣[GtSt=s,At=a],\valuefn_\policy(s) \defeq \E_\policy\!\big[\,G_t \,\big|\, S_t = s\,\big], \qquad \qfn_\policy(s,a) \defeq \E_\policy\!\big[\,G_t \,\big|\, S_t = s,\, A_t = a\,\big],

where Eπ\E_\policy averages over trajectories generated by π\policy and p\transition. The two are linked by vπ(s)=aπ(as)qπ(s,a)\valuefn_\policy(s) = \sum_a \policy(a\mid s)\,\qfn_\policy(s,a).

The Bellman expectation equation

Everything recursive about value functions comes from one move: split the return into the immediate reward and the discounted return from the next state, then condition on what happens in one step. That conditioning is the law of total expectation, the single most-used tool in RL theory.

Theorem 1.1 (Bellman expectation equation).

For every policy π\policy and state ss,

vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)].\valuefn_\policy(s) = \sum_{a} \policy(a\mid s) \sum_{s',r} \transition(s',r\mid s,a) \big[\, r + \discount\, \valuefn_\policy(s') \,\big].
Proof.

Start from the definition and peel off one step of the return, Gt=Rt+1+γGt+1G_t = R_{t+1} + \discount\, G_{t+1}:

vπ(s)=Eπ[Rt+1+γGt+1St=s](Def. 1.3; unroll Gt)=aπ(as)s,rp(s,rs,a)Eπ[r+γGt+1St+1=s](law of total expectation, condition on At,St+1,Rt+1)=aπ(as)s,rp(s,rs,a)[r+γEπ[Gt+1St+1=s]](linearity; r is fixed once we condition)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)](Markov property: Eπ[Gt+1St+1=s]=vπ(s)).\begin{aligned} \valuefn_\policy(s) &= \E_\policy[\,R_{t+1} + \discount\, G_{t+1} \mid S_t = s\,] && \text{(Def. 1.3; unroll }G_t\text{)} \\ &= \sum_{a} \policy(a\mid s) \sum_{s',r} \transition(s',r\mid s,a)\, \E_\policy[\,r + \discount\, G_{t+1} \mid S_{t+1}=s'\,] && \text{(law of total expectation, condition on }A_t, S_{t+1}, R_{t+1}\text{)} \\ &= \sum_{a} \policy(a\mid s) \sum_{s',r} \transition(s',r\mid s,a)\, \big[\, r + \discount\, \E_\policy[\,G_{t+1}\mid S_{t+1}=s'\,] \,\big] && \text{(linearity; }r\text{ is fixed once we condition)} \\ &= \sum_{a} \policy(a\mid s) \sum_{s',r} \transition(s',r\mid s,a)\, \big[\, r + \discount\, \valuefn_\policy(s') \,\big] && \text{(Markov property: }\E_\policy[G_{t+1}\mid S_{t+1}=s']=\valuefn_\policy(s')\text{).} \end{aligned}

The Markov property is what licenses the last line: the expected return from t+1t+1 onward depends on the past only through St+1=sS_{t+1}=s'. \qquad\blacksquare

Read as a system of S\lvert\statespace\rvert linear equations in the unknowns {vπ(s)}\{\valuefn_\policy(s)\}, the Bellman expectation equation defines vπ\valuefn_\policy. Collecting it into a single map on RS\R^{\statespace} gives the operator we will iterate.

Definition 1.4 (Bellman expectation operator).

The Bellman expectation operator Tπ:RSRS\bellman^\policy : \R^{\statespace} \to \R^{\statespace} acts on a value vector vv by

(Tπv)(s)aπ(as)s,rp(s,rs,a)[r+γv(s)]=rπ(s)+γspπ(ss)v(s),(\bellman^\policy v)(s) \defeq \sum_{a} \policy(a\mid s) \sum_{s',r} \transition(s',r\mid s,a)\,\big[\, r + \discount\, v(s') \,\big] = \reward_\policy(s) + \discount \sum_{s'} \transition_\policy(s'\mid s)\, v(s'),

where rπ(s)aπ(as)r(s,a)\reward_\policy(s) \defeq \sum_a \policy(a\mid s)\reward(s,a) and pπ(ss)aπ(as)p(ss,a)\transition_\policy(s'\mid s) \defeq \sum_a \policy(a\mid s)\transition(s'\mid s,a). By Theorem 1.1, vπ\valuefn_\policy is a fixed point: Tπvπ=vπ\bellman^\policy \valuefn_\policy = \valuefn_\policy.

In vector form Tπv=rπ+γPπv\bellman^\policy v = \reward_\policy + \discount P_\policy v with PπP_\policy the S×S\lvert\statespace\rvert\times\lvert\statespace\rvert transition matrix — an affine map, so its fixed point solves the linear system (IγPπ)vπ=rπ(I - \discount P_\policy)\valuefn_\policy = \reward_\policy. We will use that closed form for policy evaluation below.

Optimality: the Bellman optimality equation

So far π\policy was fixed. Optimal control asks for the best achievable value, and — remarkably — a single policy attains it simultaneously in every state.

Definition 1.5 (Optimal value functions and optimal policy).

The optimal value functions are

v(s)maxπvπ(s),q(s,a)maxπqπ(s,a),\optvaluefn(s) \defeq \max_{\policy} \valuefn_\policy(s), \qquad \optqfn(s,a) \defeq \max_{\policy} \qfn_\policy(s,a),

the maxima taken pointwise over all policies. A policy π\policy_* is optimal if vπ=v\valuefn_{\policy_*} = \optvaluefn.

Theorem 1.2 (Bellman optimality equation).

The optimal state-value function satisfies, for every ss,

v(s)=maxaA[r(s,a)+γsp(ss,a)v(s)]=maxaq(s,a),\optvaluefn(s) = \max_{a \in \actionspace} \Big[\, \reward(s,a) + \discount \sum_{s'} \transition(s'\mid s,a)\, \optvaluefn(s') \,\Big] = \max_{a} \optqfn(s,a),

and a policy is optimal iff it is greedy with respect to v\optvaluefn — i.e. in each state it puts all mass on actions attaining the maximum.

Proof.

We argue from Bellman’s principle of optimality Bellman (1957) , assuming for now that v\optvaluefn is well defined; the next section discharges that assumption — Theorem 1.3 and Corollary 1.1 establish existence and uniqueness without using this equation, so the reasoning is not circular. An optimal trajectory’s tail is itself optimal from the state it reaches. Fix ss. Any policy chooses a first action (or distribution over them) and then continues; its value cannot exceed taking the best first action aa and continuing optimally, which is exactly maxa[r(s,a)+γsp(ss,a)v(s)]\max_a [\reward(s,a) + \discount \sum_{s'} \transition(s'\mid s,a)\,\optvaluefn(s')] — so v(s)\optvaluefn(s) is at most the right-hand side. Conversely, the policy that takes the maximizing aa and then follows an optimal policy achieves that value, so v(s)\optvaluefn(s) is at least the right-hand side. Equality gives the optimality equation; the two bounds coincide exactly when the policy is greedy for v\optvaluefn. \qquad\blacksquare

The max\max over actions makes this equation nonlinear — there is no matrix inverse to solve it.

That nonlinearity is the entire reason we iterate rather than solve. Bertsekas Bertsekas (2017) develops the resulting theory abstractly as fixed-point iteration of monotone contraction operators; we specialize it to finite MDPs.

Definition 1.6 (Bellman optimality operator).

The Bellman optimality operator T:RSRS\bellmanopt : \R^{\statespace} \to \R^{\statespace} is

(Tv)(s)maxaA[r(s,a)+γsp(ss,a)v(s)].(\bellmanopt v)(s) \defeq \max_{a \in \actionspace} \Big[\, \reward(s,a) + \discount \sum_{s'} \transition(s'\mid s,a)\, v(s') \,\Big].

Theorem 1.2 says v\optvaluefn is a fixed point: Tv=v\bellmanopt \optvaluefn = \optvaluefn.

The contraction that makes it all work

Both operators share one property, and it is the technical heart of the chapter. Recall a map TT is a γ\discount-contraction in a norm if TuTvγuv\norm{Tu - Tv} \le \discount\,\norm{u-v} for all u,vu,v.

Theorem 1.3 (Bellman operators are sup-norm contractions).

For any u,vRSu, v \in \R^{\statespace},

TuTvγuv,TπuTπvγuv.\norm{\bellmanopt u - \bellmanopt v}_\infty \le \discount\,\norm{u - v}_\infty, \qquad \norm{\bellman^\policy u - \bellman^\policy v}_\infty \le \discount\,\norm{u - v}_\infty .

Both operators are γ\discount-contractions on (RS,)(\R^{\statespace}, \norm{\cdot}_\infty).

Proof.

We prove it for T\bellmanopt; the Tπ\bellman^\policy case is Exercise 1 (it is easier — no max\max). The bound takes four short steps, each labelled with the rule it uses. Fix a state ss and write Qw(s,a)r(s,a)+γsp(ss,a)w(s)Q_w(s,a) \defeq \reward(s,a) + \discount \sum_{s'} \transition(s'\mid s,a)\,w(s'), so (Tw)(s)=maxaQw(s,a)(\bellmanopt w)(s) = \max_a Q_w(s,a). The one inequality we need is that the max\max is nonexpansive: maxaf(a)maxag(a)maxaf(a)g(a)\lvert \max_a f(a) - \max_a g(a)\rvert \le \max_a \lvert f(a)-g(a)\rvert. Then

(Tu)(s)(Tv)(s)=maxaQu(s,a)maxaQv(s,a)(definition)maxaQu(s,a)Qv(s,a)(max is nonexpansive)=maxa γsp(ss,a)(u(s)v(s))(r(s,a) cancels)γmaxasp(ss,a)u(s)v(s)(triangle inequality)γmaxasp(ss,a)uv=γuv(sp(ss,a)=1).\begin{aligned} \big\lvert (\bellmanopt u)(s) - (\bellmanopt v)(s) \big\rvert &= \big\lvert \max_a Q_u(s,a) - \max_a Q_v(s,a) \big\rvert && \text{(definition)} \\ &\le \max_a \big\lvert Q_u(s,a) - Q_v(s,a) \big\rvert && \text{($\max$ is nonexpansive)} \\ &= \max_a\ \discount \Big\lvert \textstyle\sum_{s'} \transition(s'\mid s,a)\,\big(u(s') - v(s')\big) \Big\rvert && \text{($\reward(s,a)$ cancels)} \\ &\le \discount \max_a \sum_{s'} \transition(s'\mid s,a)\, \big\lvert u(s') - v(s') \big\rvert && \text{(triangle inequality)} \\ &\le \discount \max_a \sum_{s'} \transition(s'\mid s,a)\, \norm{u - v}_\infty = \discount\, \norm{u - v}_\infty && \text{($\textstyle\sum_{s'} \transition(s'\mid s,a) = 1$).} \end{aligned}

The bound is uniform in ss, so taking the max over ss on the left gives TuTvγuv\norm{\bellmanopt u - \bellmanopt v}_\infty \le \discount\norm{u-v}_\infty. \qquad\blacksquare

The discount γ<1\discount < 1 is doing all the work: it is literally the contraction modulus. With γ=1\discount = 1 the bound is vacuous and the fixed-point theory needs extra structure (proper policies, average cost) — the subject of later asides.

Corollary 1.1 (Unique value functions and geometric convergence).

Because (RS,)(\R^{\statespace}, \norm{\cdot}_\infty) is complete and both operators are γ\discount-contractions, the Banach fixed-point theorem gives:

  1. Tπ\bellman^\policy has a unique fixed point, necessarily vπ\valuefn_\policy; and T\bellmanopt has a unique fixed point, necessarily v\optvaluefn. In particular v\optvaluefn exists and is unique, and an optimal stationary policy exists (any greedy policy for v\optvaluefn — the policy-improvement theorem below confirms such a policy attains v\optvaluefn).
  2. The iterates Vk+1TVkV_{k+1} \defeq \bellmanopt V_k converge to v\optvaluefn from any start V0V_0, geometrically: VkvγkV0v.\norm{V_k - \optvaluefn}_\infty \le \discount^{k}\, \norm{V_0 - \optvaluefn}_\infty .

This corollary is the payoff. It converts an existence question (“is there a best policy?”) into a convergent algorithm (“iterate the operator”), and it tells us the error shrinks by a factor γ\discount every sweep.

Value iteration

Value iteration is now nothing more than iterate T\bellmanopt to its fixed point.

V ← 0  (any initial vector in ℝ^𝒮)
repeat:
    V ← 𝒯* V            # one sweep of the optimality operator
until ‖ΔV‖∞ < ε(1−γ)/γ   # stopping rule, justified below
return V, and the greedy policy π(s) = argmax_a [ r(s,a) + γ Σ p(s'|s,a) V(s') ]

Corollary 1.1 guarantees convergence. The stopping rule deserves a word: a small Bellman residual Vk+1Vk\norm{V_{k+1} - V_k}_\infty bounds the true error, because

Vkvγ1γVkVk1.\norm{V_{k} - \optvaluefn}_\infty \le \frac{\discount}{1-\discount}\,\norm{V_{k} - V_{k-1}}_\infty .

This telescopes from the contraction — write Vkv=jk(Vj+1Vj)V_k - \optvaluefn = \sum_{j\ge k}(V_{j+1} - V_j) and bound each term geometrically (Exercise 5). So stopping when ΔV<ε(1γ)/γ\norm{\Delta V}_\infty < \varepsilon(1-\discount)/\discount certifies Vv<ε\norm{V - \optvaluefn}_\infty < \varepsilon — a guarantee we can check at runtime without knowing v\optvaluefn. The companion experiments/python/week01/dp.py implements exactly this loop with the explicit operator as a standalone function, as the roadmap’s Week-1 task asks.

Policy iteration

Value iteration improves the value every sweep and reads off a policy at the end. Policy iteration instead alternates two exact steps on the policy, and was Howard’s original 1960 algorithm Howard (1960) :

  1. Policy evaluation. Given π\policy, solve the linear system (IγPπ)vπ=rπ(I - \discount P_\policy)\,\valuefn_\policy = \reward_\policy for vπ\valuefn_\policy (the affine fixed point of Tπ\bellman^\policy — Def. 1.4).
  2. Policy improvement. Set π(s)=arg maxaqπ(s,a)\policy'(s) = \argmax_a \qfn_\policy(s,a), greedy w.r.t. the freshly evaluated values.

Repeat until the policy stops changing. The step that makes this work is:

Theorem 1.4 (Policy improvement theorem).

Let π\policy' be greedy with respect to vπ\valuefn_\policy, so that qπ(s,π(s))vπ(s)\qfn_\policy(s, \policy'(s)) \ge \valuefn_\policy(s) for all ss. Then vπ(s)vπ(s)\valuefn_{\policy'}(s) \ge \valuefn_\policy(s) for all ss, with strict improvement in some state unless π\policy is already optimal.

Proof.

Expand the greedy inequality and re-apply it along the trajectory:

vπ(s)qπ(s,π(s))(greedy choice)=Eπ ⁣[Rt+1+γvπ(St+1)St=s](definition of qπ under the first action π)Eπ ⁣[Rt+1+γqπ(St+1,π(St+1))St=s](apply the greedy inequality at St+1)Eπ ⁣[k0γkRt+k+1St=s]=vπ(s)(unroll; the tail telescopes to vπ).\begin{aligned} \valuefn_\policy(s) &\le \qfn_\policy(s, \policy'(s)) && \text{(greedy choice)} \\ &= \E_{\policy'}\!\big[\, R_{t+1} + \discount\, \valuefn_\policy(S_{t+1}) \mid S_t = s \,\big] && \text{(definition of }\qfn_\policy\text{ under the first action }\policy'\text{)} \\ &\le \E_{\policy'}\!\big[\, R_{t+1} + \discount\, \qfn_\policy(S_{t+1}, \policy'(S_{t+1})) \mid S_t = s \,\big] && \text{(apply the greedy inequality at }S_{t+1}\text{)} \\ &\le \cdots \le \E_{\policy'}\!\Big[\, \textstyle\sum_{k\ge 0} \discount^{k} R_{t+k+1} \mid S_t = s \,\Big] = \valuefn_{\policy'}(s) && \text{(unroll; the tail telescopes to }\valuefn_{\policy'}\text{).} \end{aligned}

If no state improves strictly, the greedy inequality holds with equality everywhere, which is the Bellman optimality equation — so π\policy is already optimal. \qquad\blacksquare

Because the MDP is finite there are finitely many deterministic policies, each iteration strictly improves until none does, so policy iteration terminates at an exact optimum in finitely many steps. Value iteration and policy iteration are the two poles of what Sutton & Barto Sutton & Barto (2018) call generalized policy iteration: any interleaving of “make the value consistent with the policy” and “make the policy greedy for the value” converges to the same fixed point. Week 2 studies what happens between the poles — asynchronous, Gauss–Seidel, and prioritized sweeps.

The dynamic-programming bridge

The roadmap’s Week-1 writing prompt is to place the Bellman equation beside its continuous cousin. They are the same principle of optimality at different limits. Discretize time and state and the cost-to-go obeys the discrete Bellman recursion above; take the limit of vanishing time-step on a continuous state and the same recursion becomes the Hamilton–Jacobi–Bellman equation. Control flips the convention from maximizing reward to minimizing a running cost =r\ell = -\reward, with cost-to-go JJ; the stationary discounted HJB equation is

ρJ(x)=mina[(x,a)+J(x) ⁣f(x,a)],ρ=lnγ,\rho\, J(x) = \min_{a}\Big[\, \ell(x,a) + \nabla J(x)^{\!\top} f(x,a) \,\Big], \qquad \rho = -\ln\discount,

whose ρJ\rho J term is the continuous-time echo of the discount γ\discount that drove the contraction above (set ρ=0\rho = 0 for the undiscounted limit). Value iteration is the fixed-point solver for the discrete equation; the LQR Riccati recursion (Ch. 13) is the closed-form solver for the HJB equation when ff is linear and \ell is quadratic. Holding this identity in view is the point of the whole curriculum: control theory and RL are two dialects for the same fixed point.

What’s next

  • Week 2 opens up the iteration itself: asynchronous and Gauss–Seidel value iteration, prioritized sweeping, real-time DP, and residual scheduling — the practical face of “iterate the contraction.”
  • Week 3 replaces the known model p\transition with sampled returns (Monte Carlo), the first step away from dynamic programming toward learning.

Exercises

  1. (Prove) Show that the Bellman expectation operator Tπ\bellman^\policy is a γ\discount-contraction in \norm{\cdot}_\infty. (This is the no-max\max case of Theorem 1.3.)

    Solution

    For any ss, (Tπu)(s)(Tπv)(s)=γspπ(ss)(u(s)v(s))(\bellman^\policy u)(s) - (\bellman^\policy v)(s) = \discount \sum_{s'} \transition_\policy(s'\mid s)\,(u(s') - v(s')) since rπ(s)\reward_\policy(s) cancels. Taking absolute values and using the triangle inequality with spπ(ss)=1\sum_{s'}\transition_\policy(s'\mid s)=1, (Tπu)(s)(Tπv)(s)γspπ(ss)uv=γuv\lvert(\bellman^\policy u)(s) - (\bellman^\policy v)(s)\rvert \le \discount \sum_{s'}\transition_\policy(s'\mid s)\norm{u-v}_\infty = \discount\norm{u-v}_\infty. Maximizing over ss gives the claim. (No max\max-nonexpansiveness step is needed, because Tπ\bellman^\policy is affine.)

  2. (Derive) State and prove the Bellman expectation equation for the action-value qπ(s,a)\qfn_\policy(s,a).

    Solution

    Conditioning on the first transition and then the next action,

    qπ(s,a)=s,rp(s,rs,a)[r+γaπ(as)qπ(s,a)].\qfn_\policy(s,a) = \sum_{s',r}\transition(s',r\mid s,a)\Big[\,r + \discount \sum_{a'}\policy(a'\mid s')\,\qfn_\policy(s',a')\,\Big].

    The derivation is identical to Theorem 1.1 with the first action held fixed at aa and the recursion closing on qπ\qfn_\policy via vπ(s)=aπ(as)qπ(s,a)\valuefn_\policy(s') = \sum_{a'}\policy(a'\mid s')\qfn_\policy(s',a').

  3. (Compute) Take the two-state MDP S={A,B}\statespace=\{A,B\}, actions {stay,switch}\{\text{stay},\text{switch}\}, γ=0.9\discount=0.9, with deterministic dynamics: in AA, stay A\to A (r=0r=0) and switch B\to B (r=0r=0); in BB, stay B\to B (r=1r=1) and switch A\to A (r=0r=0). Evaluate the always-stay policy by solving (IγPπ)vπ=rπ(I-\discount P_\policy)\valuefn_\policy = \reward_\policy.

    Solution

    Under always-stay, Pπ=IP_\policy = I and rπ=(0,1)\reward_\policy = (0, 1), so (I0.9I)vπ=(0,1)(I - 0.9 I)\valuefn_\policy = (0,1) gives 0.1vπ=(0,1)0.1\,\valuefn_\policy = (0,1), i.e. vπ=(0,10)\valuefn_\policy = (0, 10). State BB self-loops collecting 11 forever (1/(1γ)=101/(1-\discount)=10); state AA self-loops collecting 00. This is the exact case the companion test asserts.

  4. (Prove) Derive the geometric error bound of Corollary 1.1(2), VkvγkV0v\norm{V_k - \optvaluefn}_\infty \le \discount^{k}\norm{V_0 - \optvaluefn}_\infty, from the contraction property and Tv=v\bellmanopt\optvaluefn=\optvaluefn.

    Solution

    Vkv=TVk1TvγVk1v\norm{V_k - \optvaluefn}_\infty = \norm{\bellmanopt V_{k-1} - \bellmanopt \optvaluefn}_\infty \le \discount\norm{V_{k-1}-\optvaluefn}_\infty by Theorem 1.3 and the fixed-point identity. Iterating the inequality kk times gives the bound.

  5. (Implement) Add the certified stopping rule to value iteration: stop when ΔV<ε(1γ)/γ\norm{\Delta V}_\infty < \varepsilon(1-\discount)/\discount and verify empirically that the returned VV satisfies Vv<ε\norm{V-\optvaluefn}_\infty < \varepsilon on the two-state MDP (using the analytic v=(9,10)\optvaluefn=(9,10)). Run against experiments/python/week01/dp.py.

    Solution

    The bound follows by writing Vkv=jk(Vj+1Vj)V_k - \optvaluefn = \sum_{j\ge k}(V_{j+1}-V_j) and bounding each term by the contraction: Vj+1VjγjkVkVk1\norm{V_{j+1}-V_j}_\infty \le \discount^{\,j-k}\norm{V_{k}-V_{k-1}}_\infty, a geometric series summing to γ1γΔV\frac{\discount}{1-\discount}\norm{\Delta V}_\infty. The companion’s value_iteration(..., tol=ε) implements this and its test asserts the resulting error is below ε\varepsilon.

  6. (Extend) Show the optimal value function is the solution of the linear program minvsv(s)\min_{v}\sum_s v(s) subject to v(s)r(s,a)+γsp(ss,a)v(s)v(s) \ge \reward(s,a) + \discount \sum_{s'}\transition(s'\mid s,a)v(s') for all s,as,a.

    Solution

    Feasibility means vTvv \ge \bellmanopt v pointwise; by monotonicity of T\bellmanopt this implies vvv \ge \optvaluefn, so v\optvaluefn is the smallest feasible point and minimizing sv(s)\sum_s v(s) selects it. The constraints linearize the max\max (one inequality per action), turning the nonlinear optimality equation into an LP — the basis of the dual/occupancy-measure view revisited with safe RL in Week 25.

Companion code

The Week-1 companion lives at experiments/python/week01/ (the repo’s three-language convention). It is pure NumPy at the core — the algorithms and the correctness test carry no environment dependency — with an optional Gymnasium adapter for the canonical FrozenLake showcase.

  • dp.py — the Bellman optimality operator as an explicit function, plus value_iteration, policy_evaluation (exact linear solve), policy_improvement, and policy_iteration, all on a generic finite MDP (P, R, gamma).
  • test_dp.py — mathematical-correctness tests: convergence to the analytic v=(9,10)\optvaluefn=(9,10) on the two-state MDP of Exercise 3, the contraction inequality of Theorem 1.3, the fixed-point identity, the geometric bound, and VI/PI agreement (FrozenLake checks run only when gymnasium is installed).
  • frozenlake.py — builds (P, R, gamma) from gymnasium’s FrozenLake-v1 and runs VI and PI as a worked showcase.
# core algorithms + correctness tests (no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week01/test_dp.py -q

# canonical FrozenLake showcase (optional extra: pip install "gymnasium")
PYTHONPATH=. python experiments/python/week01/frozenlake.py