Markov Decision Processes and Dynamic Programming
The finite MDP, the Bellman expectation and optimality equations, and the gamma-contraction that makes value iteration and policy iteration converge — dynamic programming as the spine the rest of the curriculum returns to.
On this page
Markov Decision Processes and Dynamic Programming
Where we are. This is the first chapter, and it states the object the whole curriculum orbits: the finite Markov decision process and the two equations that pin down optimal behaviour in it. The claim is small and load-bearing — the optimal value function is the unique fixed point of a -contraction, and value iteration and policy iteration are two ways of reaching that fixed point. Everything later — LQR, MPC, model-based and model-free RL — computes, approximates, or learns this same fixed point when the MDP is too large, too continuous, or unknown.
Notation and conventions
We use the case convention of Sutton & Barto Sutton & Barto (2018) throughout: lowercase names a true object we want (), and uppercase names an estimate we hold and update ().
| Symbol | Meaning | |---|---| | | finite state and action sets; | | | dynamics — joint probability of next state and reward | | | policy — probability of action in state | | | discount factor (strict inequality is load-bearing) | | | state- and action-value functions of (true) | | | optimal value functions | | | the -th value iterate (an estimate of ) | | | Bellman expectation / optimality operators | | | sup norm on : |
A value function on a finite state set is a vector in , a point in -dimensional space. The operators below move that point around, and “solving the MDP” means finding the one point they hold still.
The finite Markov decision process
A Markov decision process is a controlled Markov chain: at each step the agent sees a state, picks an action, and the environment returns a reward and a next state, with the Markov property that the future depends on the past only through the current state.
A finite Markov decision process is a tuple with finite, discount , and dynamics
a probability distribution over for each . We write the state-transition kernel and expected reward as the marginals
The Markov property is the modelling commitment that earns us everything that follows: because the next state and reward depend only on , a value function needs only one argument — the current state — and the recursion below closes on itself. The canonical reference for the finite-MDP formalism and its solution methods is Puterman Puterman (1994) ; the founding treatment of the underlying optimization principle is Bellman Bellman (1957) .
A (stationary) policy is a distribution over actions for each state. Acting under produces a trajectory , and the discounted return from time is
Because rewards are bounded on a finite MDP and , the series converges absolutely, with where .
Value functions
A value function answers one question: starting here and acting under , how much discounted reward do I expect? The state-value fixes the starting state; the action-value also fixes the first action.
For a policy , the state-value and action-value functions are
where averages over trajectories generated by and . The two are linked by .
The Bellman expectation equation
Everything recursive about value functions comes from one move: split the return into the immediate reward and the discounted return from the next state, then condition on what happens in one step. That conditioning is the law of total expectation, the single most-used tool in RL theory.
For every policy and state ,
Start from the definition and peel off one step of the return, :
The Markov property is what licenses the last line: the expected return from onward depends on the past only through .
Read as a system of linear equations in the unknowns , the Bellman expectation equation defines . Collecting it into a single map on gives the operator we will iterate.
The Bellman expectation operator acts on a value vector by
where and . By Theorem 1.1, is a fixed point: .
In vector form with the transition matrix — an affine map, so its fixed point solves the linear system . We will use that closed form for policy evaluation below.
Optimality: the Bellman optimality equation
So far was fixed. Optimal control asks for the best achievable value, and — remarkably — a single policy attains it simultaneously in every state.
The optimal value functions are
the maxima taken pointwise over all policies. A policy is optimal if .
The optimal state-value function satisfies, for every ,
and a policy is optimal iff it is greedy with respect to — i.e. in each state it puts all mass on actions attaining the maximum.
We argue from Bellman’s principle of optimality Bellman (1957) , assuming for now that is well defined; the next section discharges that assumption — Theorem 1.3 and Corollary 1.1 establish existence and uniqueness without using this equation, so the reasoning is not circular. An optimal trajectory’s tail is itself optimal from the state it reaches. Fix . Any policy chooses a first action (or distribution over them) and then continues; its value cannot exceed taking the best first action and continuing optimally, which is exactly — so is at most the right-hand side. Conversely, the policy that takes the maximizing and then follows an optimal policy achieves that value, so is at least the right-hand side. Equality gives the optimality equation; the two bounds coincide exactly when the policy is greedy for .
The over actions makes this equation nonlinear — there is no matrix inverse to solve it. That nonlinearity is the entire reason we iterate rather than solve. Bertsekas Bertsekas (2017) develops the resulting theory abstractly as fixed-point iteration of monotone contraction operators; we specialize it to finite MDPs.
The Bellman optimality operator is
Theorem 1.2 says is a fixed point: .
The contraction that makes it all work
Both operators share one property, and it is the technical heart of the chapter. Recall a map is a -contraction in a norm if for all .
For any ,
Both operators are -contractions on .
We prove it for ; the case is Exercise 1 (it is easier — no ). The bound takes four short steps, each labelled with the rule it uses. Fix a state and write , so . The one inequality we need is that the is nonexpansive: . Then
The bound is uniform in , so taking the max over on the left gives .
The discount is doing all the work: it is literally the contraction modulus. With the bound is vacuous and the fixed-point theory needs extra structure (proper policies, average cost) — the subject of later asides.
Because is complete and both operators are -contractions, the Banach fixed-point theorem gives:
- has a unique fixed point, necessarily ; and has a unique fixed point, necessarily . In particular exists and is unique, and an optimal stationary policy exists (any greedy policy for — the policy-improvement theorem below confirms such a policy attains ).
- The iterates converge to from any start , geometrically:
This corollary is the payoff. It converts an existence question (“is there a best policy?”) into a convergent algorithm (“iterate the operator”), and it tells us the error shrinks by a factor every sweep.
Value iteration
Value iteration is now nothing more than iterate to its fixed point.
V ← 0 (any initial vector in ℝ^𝒮)
repeat:
V ← 𝒯* V # one sweep of the optimality operator
until ‖ΔV‖∞ < ε(1−γ)/γ # stopping rule, justified below
return V, and the greedy policy π(s) = argmax_a [ r(s,a) + γ Σ p(s'|s,a) V(s') ]
Corollary 1.1 guarantees convergence. The stopping rule deserves a word: a small Bellman residual bounds the true error, because
This telescopes from the contraction — write and bound each term geometrically (Exercise 5). So stopping when
certifies
— a guarantee we can check at
runtime without knowing . The companion
experiments/python/week01/dp.py implements exactly this loop with the explicit
operator as a standalone function, as the roadmap’s Week-1 task asks.
Policy iteration
Value iteration improves the value every sweep and reads off a policy at the end. Policy iteration instead alternates two exact steps on the policy, and was Howard’s original 1960 algorithm Howard (1960) :
- Policy evaluation. Given , solve the linear system for (the affine fixed point of — Def. 1.4).
- Policy improvement. Set , greedy w.r.t. the freshly evaluated values.
Repeat until the policy stops changing. The step that makes this work is:
Let be greedy with respect to , so that for all . Then for all , with strict improvement in some state unless is already optimal.
Expand the greedy inequality and re-apply it along the trajectory:
If no state improves strictly, the greedy inequality holds with equality everywhere, which is the Bellman optimality equation — so is already optimal.
Because the MDP is finite there are finitely many deterministic policies, each iteration strictly improves until none does, so policy iteration terminates at an exact optimum in finitely many steps. Value iteration and policy iteration are the two poles of what Sutton & Barto Sutton & Barto (2018) call generalized policy iteration: any interleaving of “make the value consistent with the policy” and “make the policy greedy for the value” converges to the same fixed point. Week 2 studies what happens between the poles — asynchronous, Gauss–Seidel, and prioritized sweeps.
The dynamic-programming bridge
The roadmap’s Week-1 writing prompt is to place the Bellman equation beside its continuous cousin. They are the same principle of optimality at different limits. Discretize time and state and the cost-to-go obeys the discrete Bellman recursion above; take the limit of vanishing time-step on a continuous state and the same recursion becomes the Hamilton–Jacobi–Bellman equation. Control flips the convention from maximizing reward to minimizing a running cost , with cost-to-go ; the stationary discounted HJB equation is
whose term is the continuous-time echo of the discount that drove the contraction above (set for the undiscounted limit). Value iteration is the fixed-point solver for the discrete equation; the LQR Riccati recursion (Ch. 13) is the closed-form solver for the HJB equation when is linear and is quadratic. Holding this identity in view is the point of the whole curriculum: control theory and RL are two dialects for the same fixed point.
What’s next
- Week 2 opens up the iteration itself: asynchronous and Gauss–Seidel value iteration, prioritized sweeping, real-time DP, and residual scheduling — the practical face of “iterate the contraction.”
- Week 3 replaces the known model with sampled returns (Monte Carlo), the first step away from dynamic programming toward learning.
Exercises
-
(Prove) Show that the Bellman expectation operator is a -contraction in . (This is the no- case of Theorem 1.3.)
Solution
For any , since cancels. Taking absolute values and using the triangle inequality with , . Maximizing over gives the claim. (No -nonexpansiveness step is needed, because is affine.)
-
(Derive) State and prove the Bellman expectation equation for the action-value .
Solution
Conditioning on the first transition and then the next action,
The derivation is identical to Theorem 1.1 with the first action held fixed at and the recursion closing on via .
-
(Compute) Take the two-state MDP , actions , , with deterministic dynamics: in , stay () and switch (); in , stay () and switch (). Evaluate the always-stay policy by solving .
Solution
Under always-stay, and , so gives , i.e. . State self-loops collecting forever (); state self-loops collecting . This is the exact case the companion test asserts.
-
(Prove) Derive the geometric error bound of Corollary 1.1(2), , from the contraction property and .
Solution
by Theorem 1.3 and the fixed-point identity. Iterating the inequality times gives the bound.
-
(Implement) Add the certified stopping rule to value iteration: stop when and verify empirically that the returned satisfies on the two-state MDP (using the analytic ). Run against
experiments/python/week01/dp.py.Solution
The bound follows by writing and bounding each term by the contraction: , a geometric series summing to . The companion’s
value_iteration(..., tol=ε)implements this and its test asserts the resulting error is below . -
(Extend) Show the optimal value function is the solution of the linear program subject to for all .
Solution
Feasibility means pointwise; by monotonicity of this implies , so is the smallest feasible point and minimizing selects it. The constraints linearize the (one inequality per action), turning the nonlinear optimality equation into an LP — the basis of the dual/occupancy-measure view revisited with safe RL in Week 25.
Companion code
The Week-1 companion lives at experiments/python/week01/ (the repo’s
three-language convention). It is pure NumPy at the core — the algorithms and the
correctness test carry no environment dependency — with an optional Gymnasium
adapter for the canonical FrozenLake showcase.
dp.py— the Bellman optimality operator as an explicit function, plusvalue_iteration,policy_evaluation(exact linear solve),policy_improvement, andpolicy_iteration, all on a generic finite MDP(P, R, gamma).test_dp.py— mathematical-correctness tests: convergence to the analytic on the two-state MDP of Exercise 3, the contraction inequality of Theorem 1.3, the fixed-point identity, the geometric bound, and VI/PI agreement (FrozenLake checks run only whengymnasiumis installed).frozenlake.py— builds(P, R, gamma)fromgymnasium’sFrozenLake-v1and runs VI and PI as a worked showcase.
# core algorithms + correctness tests (no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week01/test_dp.py -q
# canonical FrozenLake showcase (optional extra: pip install "gymnasium")
PYTHONPATH=. python experiments/python/week01/frozenlake.py