Part I · Foundations Week 1 Published dp.py test_dp.py

Markov Decision Processes and Dynamic Programming

The finite MDP, the Bellman expectation and optimality equations, and the gamma-contraction that makes value iteration and policy iteration converge — dynamic programming as the spine the rest of the curriculum returns to.

Markov Decision Processes and Dynamic Programming

Where we are. This is the first chapter, and it states the object the whole curriculum orbits: the finite Markov decision process and the two equations that pin down optimal behaviour in it. The claim is small and load-bearing — the optimal value function is the unique fixed point of a γ\discount-contraction, and value iteration and policy iteration are two ways of reaching that fixed point. Everything later — LQR, MPC, model-based and model-free RL — computes, approximates, or learns this same fixed point when the MDP is too large, too continuous, or unknown.

Notation and conventions

We use the case convention of Sutton & Barto Sutton & Barto (2018) throughout: lowercase names a true object we want (vπ\valuefn_\policy), and uppercase names an estimate we hold and update (VkV_k).

| Symbol | Meaning | |---|---| | S, A\statespace,\ \actionspace | finite state and action sets; S,A<\lvert\statespace\rvert, \lvert\actionspace\rvert < \infty | | p(s,rs,a)\transition(s',r \mid s,a) | dynamics — joint probability of next state ss' and reward rr | | π(as)\policy(a \mid s) | policy — probability of action aa in state ss | | γ[0,1)\discount \in [0,1) | discount factor (strict inequality is load-bearing) | | vπ, qπ\valuefn_\policy,\ \qfn_\policy | state- and action-value functions of π\policy (true) | | v, q\optvaluefn,\ \optqfn | optimal value functions | | VkRSV_k \in \R^{\statespace} | the kk-th value iterate (an estimate of v\optvaluefn) | | Tπ, T\bellman^\policy,\ \bellmanopt | Bellman expectation / optimality operators | | \norm{\cdot}_\infty | sup norm on RS\R^{\statespace}: vmaxsv(s)\norm{v}_\infty \defeq \max_{s} \lvert v(s)\rvert |

A value function on a finite state set is a vector in RS\R^{\statespace}, a point in S\lvert\statespace\rvert-dimensional space.

The operators below move that point around, and “solving the MDP” means finding the one point they hold still.

The finite Markov decision process

A Markov decision process is a controlled Markov chain: at each step the agent sees a state, picks an action, and the environment returns a reward and a next state, with the Markov property that the future depends on the past only through the current state.

Definition 1.1 (Finite MDP).

A finite Markov decision process is a tuple (S,A,p,γ)(\statespace, \actionspace, \transition, \discount) with S,A\statespace, \actionspace finite, discount γ[0,1)\discount \in [0,1), and dynamics

p(s,rs,a)P ⁣(St+1=s,Rt+1=rSt=s,At=a),\transition(s',r \mid s,a) \defeq \Prob\!\big(S_{t+1}=s',\, R_{t+1}=r \,\big|\, S_t=s,\, A_t=a\big),

a probability distribution over (s,r)(s',r) for each (s,a)(s,a). We write the state-transition kernel and expected reward as the marginals

p(ss,a)rp(s,rs,a),r(s,a)s,rp(s,rs,a)r.\transition(s' \mid s,a) \defeq \sum_{r} \transition(s',r \mid s,a), \qquad \reward(s,a) \defeq \sum_{s',r} \transition(s',r \mid s,a)\, r .

The Markov property is the modelling commitment that earns us everything that follows: because the next state and reward depend only on (St,At)(S_t, A_t), a value function needs only one argument — the current state — and the recursion below closes on itself.

The canonical reference for the finite-MDP formalism and its solution methods is Puterman Puterman (1994) ; the founding treatment of the underlying optimization principle is Bellman Bellman (1957) .

Definition 1.2 (Policy and discounted return).

A (stationary) policy π(as)\policy(a\mid s) is a distribution over actions for each state. Acting under π\policy produces a trajectory S0,A0,R1,S1,A1,R2,S_0, A_0, R_1, S_1, A_1, R_2, \dots, and the discounted return from time tt is

Gtk=0γkRt+k+1.G_t \defeq \sum_{k=0}^{\infty} \discount^{k} R_{t+k+1}.

Because rewards are bounded on a finite MDP and γ<1\discount<1, the series converges absolutely, with GtRmax/(1γ)\lvert G_t\rvert \le R_{\max}/(1-\discount) where Rmaxmaxs,ar(s,a)R_{\max} \defeq \max_{s,a}\lvert\reward(s,a)\rvert.

Value functions

A value function answers one question: starting here and acting under π\policy, how much discounted reward do I expect? The state-value fixes the starting state; the action-value also fixes the first action.

Definition 1.3 (State- and action-value functions).

For a policy π\policy, the state-value and action-value functions are

vπ(s)Eπ ⁣[GtSt=s],qπ(s,a)Eπ ⁣[GtSt=s,At=a],\valuefn_\policy(s) \defeq \E_\policy\!\big[\,G_t \,\big|\, S_t = s\,\big], \qquad \qfn_\policy(s,a) \defeq \E_\policy\!\big[\,G_t \,\big|\, S_t = s,\, A_t = a\,\big],

where Eπ\E_\policy averages over trajectories generated by π\policy and p\transition. The two are linked by vπ(s)=aπ(as)qπ(s,a)\valuefn_\policy(s) = \sum_a \policy(a\mid s)\,\qfn_\policy(s,a).

The Bellman expectation equation

Everything recursive about value functions comes from one move: split the return into the immediate reward and the discounted return from the next state, then condition on what happens in one step. That conditioning is the law of total expectation, the single most-used tool in RL theory.

Theorem 1.1 (Bellman expectation equation).

For every policy π\policy and state ss,

vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)].\valuefn_\policy(s) = \sum_{a} \policy(a\mid s) \sum_{s',r} \transition(s',r\mid s,a) \big[\, r + \discount\, \valuefn_\policy(s') \,\big].
Proof.

Start from the definition and peel off one step of the return, Gt=Rt+1+γGt+1G_t = R_{t+1} + \discount\, G_{t+1}:

vπ(s)=Eπ[Rt+1+γGt+1St=s](Def. 1.3; unroll Gt)=aπ(as)s,rp(s,rs,a)Eπ[r+γGt+1St+1=s](law of total expectation, condition on At,St+1,Rt+1)=aπ(as)s,rp(s,rs,a)[r+γEπ[Gt+1St+1=s]](linearity; r is fixed once we condition)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)](Markov property: Eπ[Gt+1St+1=s]=vπ(s)).\begin{aligned} \valuefn_\policy(s) &= \E_\policy[\,R_{t+1} + \discount\, G_{t+1} \mid S_t = s\,] && \text{(Def. 1.3; unroll }G_t\text{)} \\ &= \sum_{a} \policy(a\mid s) \sum_{s',r} \transition(s',r\mid s,a)\, \E_\policy[\,r + \discount\, G_{t+1} \mid S_{t+1}=s'\,] && \text{(law of total expectation, condition on }A_t, S_{t+1}, R_{t+1}\text{)} \\ &= \sum_{a} \policy(a\mid s) \sum_{s',r} \transition(s',r\mid s,a)\, \big[\, r + \discount\, \E_\policy[\,G_{t+1}\mid S_{t+1}=s'\,] \,\big] && \text{(linearity; }r\text{ is fixed once we condition)} \\ &= \sum_{a} \policy(a\mid s) \sum_{s',r} \transition(s',r\mid s,a)\, \big[\, r + \discount\, \valuefn_\policy(s') \,\big] && \text{(Markov property: }\E_\policy[G_{t+1}\mid S_{t+1}=s']=\valuefn_\policy(s')\text{).} \end{aligned}

The Markov property is what licenses the last line: the expected return from t+1t+1 onward depends on the past only through St+1=sS_{t+1}=s'. \qquad\blacksquare

Read as a system of S\lvert\statespace\rvert linear equations in the unknowns {vπ(s)}\{\valuefn_\policy(s)\}, the Bellman expectation equation defines vπ\valuefn_\policy. Collecting it into a single map on RS\R^{\statespace} gives the operator we will iterate.

Definition 1.4 (Bellman expectation operator).

The Bellman expectation operator Tπ:RSRS\bellman^\policy : \R^{\statespace} \to \R^{\statespace} acts on a value vector vv by

(Tπv)(s)aπ(as)s,rp(s,rs,a)[r+γv(s)]=rπ(s)+γspπ(ss)v(s),(\bellman^\policy v)(s) \defeq \sum_{a} \policy(a\mid s) \sum_{s',r} \transition(s',r\mid s,a)\,\big[\, r + \discount\, v(s') \,\big] = \reward_\policy(s) + \discount \sum_{s'} \transition_\policy(s'\mid s)\, v(s'),

where rπ(s)aπ(as)r(s,a)\reward_\policy(s) \defeq \sum_a \policy(a\mid s)\reward(s,a) and pπ(ss)aπ(as)p(ss,a)\transition_\policy(s'\mid s) \defeq \sum_a \policy(a\mid s)\transition(s'\mid s,a). By Theorem 1.1, vπ\valuefn_\policy is a fixed point: Tπvπ=vπ\bellman^\policy \valuefn_\policy = \valuefn_\policy.

In vector form Tπv=rπ+γPπv\bellman^\policy v = \reward_\policy + \discount P_\policy v with PπP_\policy the S×S\lvert\statespace\rvert\times\lvert\statespace\rvert transition matrix — an affine map, so its fixed point solves the linear system (IγPπ)vπ=rπ(I - \discount P_\policy)\valuefn_\policy = \reward_\policy. We will use that closed form for policy evaluation below.

Optimality: the Bellman optimality equation

So far π\policy was fixed. Optimal control asks for the best achievable value, and — remarkably — a single policy attains it simultaneously in every state.

Definition 1.5 (Optimal value functions and optimal policy).

The optimal value functions are

v(s)maxπvπ(s),q(s,a)maxπqπ(s,a),\optvaluefn(s) \defeq \max_{\policy} \valuefn_\policy(s), \qquad \optqfn(s,a) \defeq \max_{\policy} \qfn_\policy(s,a),

the maxima taken pointwise over all policies. A policy π\policy_* is optimal if vπ=v\valuefn_{\policy_*} = \optvaluefn.

Theorem 1.2 (Bellman optimality equation).

The optimal state-value function satisfies, for every ss,

v(s)=maxaA[r(s,a)+γsp(ss,a)v(s)]=maxaq(s,a),\optvaluefn(s) = \max_{a \in \actionspace} \Big[\, \reward(s,a) + \discount \sum_{s'} \transition(s'\mid s,a)\, \optvaluefn(s') \,\Big] = \max_{a} \optqfn(s,a),

and a policy is optimal iff it is greedy with respect to v\optvaluefn — i.e. in each state it puts all mass on actions attaining the maximum.

Proof.

We argue from Bellman’s principle of optimality Bellman (1957) , assuming for now that v\optvaluefn is well defined; the next section discharges that assumption — Theorem 1.3 and Corollary 1.1 establish existence and uniqueness without using this equation, so the reasoning is not circular. An optimal trajectory’s tail is itself optimal from the state it reaches. Fix ss. Any policy chooses a first action (or distribution over them) and then continues; its value cannot exceed taking the best first action aa and continuing optimally, which is exactly maxa[r(s,a)+γsp(ss,a)v(s)]\max_a [\reward(s,a) + \discount \sum_{s'} \transition(s'\mid s,a)\,\optvaluefn(s')] — so v(s)\optvaluefn(s) is at most the right-hand side. Conversely, the policy that takes the maximizing aa and then follows an optimal policy achieves that value, so v(s)\optvaluefn(s) is at least the right-hand side. Equality gives the optimality equation; the two bounds coincide exactly when the policy is greedy for v\optvaluefn. \qquad\blacksquare

The max\max over actions makes this equation nonlinear — there is no matrix inverse to solve it.

That nonlinearity is the entire reason we iterate rather than solve. Bertsekas Bertsekas (2017) develops the resulting theory abstractly as fixed-point iteration of monotone contraction operators; we specialize it to finite MDPs.

Definition 1.6 (Bellman optimality operator).

The Bellman optimality operator T:RSRS\bellmanopt : \R^{\statespace} \to \R^{\statespace} is

(Tv)(s)maxaA[r(s,a)+γsp(ss,a)v(s)].(\bellmanopt v)(s) \defeq \max_{a \in \actionspace} \Big[\, \reward(s,a) + \discount \sum_{s'} \transition(s'\mid s,a)\, v(s') \,\Big].

Theorem 1.2 says v\optvaluefn is a fixed point: Tv=v\bellmanopt \optvaluefn = \optvaluefn.

The contraction that makes it all work

Both operators share one property, and it is the technical heart of the chapter. Recall a map TT is a γ\discount-contraction in a norm if TuTvγuv\norm{Tu - Tv} \le \discount\,\norm{u-v} for all u,vu,v.

Theorem 1.3 (Bellman operators are sup-norm contractions).

For any u,vRSu, v \in \R^{\statespace},

TuTvγuv,TπuTπvγuv.\norm{\bellmanopt u - \bellmanopt v}_\infty \le \discount\,\norm{u - v}_\infty, \qquad \norm{\bellman^\policy u - \bellman^\policy v}_\infty \le \discount\,\norm{u - v}_\infty .

Both operators are γ\discount-contractions on (RS,)(\R^{\statespace}, \norm{\cdot}_\infty).

Proof.

We prove it for T\bellmanopt; the Tπ\bellman^\policy case is Exercise 1 (it is easier — no max\max). The bound takes four short steps, each labelled with the rule it uses. Fix a state ss and write Qw(s,a)r(s,a)+γsp(ss,a)w(s)Q_w(s,a) \defeq \reward(s,a) + \discount \sum_{s'} \transition(s'\mid s,a)\,w(s'), so (Tw)(s)=maxaQw(s,a)(\bellmanopt w)(s) = \max_a Q_w(s,a). The one inequality we need is that the max\max is nonexpansive: maxaf(a)maxag(a)maxaf(a)g(a)\lvert \max_a f(a) - \max_a g(a)\rvert \le \max_a \lvert f(a)-g(a)\rvert. Then

(Tu)(s)(Tv)(s)=maxaQu(s,a)maxaQv(s,a)(definition)maxaQu(s,a)Qv(s,a)(max is nonexpansive)=maxa γsp(ss,a)(u(s)v(s))(r(s,a) cancels)γmaxasp(ss,a)u(s)v(s)(triangle inequality)γmaxasp(ss,a)uv=γuv(sp(ss,a)=1).\begin{aligned} \big\lvert (\bellmanopt u)(s) - (\bellmanopt v)(s) \big\rvert &= \big\lvert \max_a Q_u(s,a) - \max_a Q_v(s,a) \big\rvert && \text{(definition)} \\ &\le \max_a \big\lvert Q_u(s,a) - Q_v(s,a) \big\rvert && \text{($\max$ is nonexpansive)} \\ &= \max_a\ \discount \Big\lvert \textstyle\sum_{s'} \transition(s'\mid s,a)\,\big(u(s') - v(s')\big) \Big\rvert && \text{($\reward(s,a)$ cancels)} \\ &\le \discount \max_a \sum_{s'} \transition(s'\mid s,a)\, \big\lvert u(s') - v(s') \big\rvert && \text{(triangle inequality)} \\ &\le \discount \max_a \sum_{s'} \transition(s'\mid s,a)\, \norm{u - v}_\infty = \discount\, \norm{u - v}_\infty && \text{($\textstyle\sum_{s'} \transition(s'\mid s,a) = 1$).} \end{aligned}

The bound is uniform in ss, so taking the max over ss on the left gives TuTvγuv\norm{\bellmanopt u - \bellmanopt v}_\infty \le \discount\norm{u-v}_\infty. \qquad\blacksquare

The discount γ<1\discount < 1 is doing all the work: it is literally the contraction modulus. With γ=1\discount = 1 the bound is vacuous and the fixed-point theory needs extra structure (proper policies, average cost) — the subject of later asides.

Corollary 1.1 (Unique value functions and geometric convergence).

Because (RS,)(\R^{\statespace}, \norm{\cdot}_\infty) is complete and both operators are γ\discount-contractions, the Banach fixed-point theorem gives:

  1. Tπ\bellman^\policy has a unique fixed point, necessarily vπ\valuefn_\policy; and T\bellmanopt has a unique fixed point, necessarily v\optvaluefn. In particular v\optvaluefn exists and is unique, and an optimal stationary policy exists (any greedy policy for v\optvaluefn — the policy-improvement theorem below confirms such a policy attains v\optvaluefn).
  2. The iterates Vk+1TVkV_{k+1} \defeq \bellmanopt V_k converge to v\optvaluefn from any start V0V_0, geometrically: VkvγkV0v.\norm{V_k - \optvaluefn}_\infty \le \discount^{k}\, \norm{V_0 - \optvaluefn}_\infty .

This corollary is the payoff. It converts an existence question (“is there a best policy?”) into a convergent algorithm (“iterate the operator”), and it tells us the error shrinks by a factor γ\discount every sweep.

Value iteration

Value iteration is now nothing more than iterate T\bellmanopt to its fixed point.

V ← 0  (any initial vector in ℝ^𝒮)
repeat:
    V ← 𝒯* V            # one sweep of the optimality operator
until ‖ΔV‖∞ < ε(1−γ)/γ   # stopping rule, justified below
return V, and the greedy policy π(s) = argmax_a [ r(s,a) + γ Σ p(s'|s,a) V(s') ]

Corollary 1.1 guarantees convergence. The stopping rule deserves a word: a small Bellman residual Vk+1Vk\norm{V_{k+1} - V_k}_\infty bounds the true error, because

Vkvγ1γVkVk1.\norm{V_{k} - \optvaluefn}_\infty \le \frac{\discount}{1-\discount}\,\norm{V_{k} - V_{k-1}}_\infty .

This telescopes from the contraction — write Vkv=jk(Vj+1Vj)V_k - \optvaluefn = \sum_{j\ge k}(V_{j+1} - V_j) and bound each term geometrically (Exercise 5). So stopping when ΔV<ε(1γ)/γ\norm{\Delta V}_\infty < \varepsilon(1-\discount)/\discount certifies Vv<ε\norm{V - \optvaluefn}_\infty < \varepsilon — a guarantee we can check at runtime without knowing v\optvaluefn. The companion experiments/python/week01/dp.py implements exactly this loop with the explicit operator as a standalone function, as the roadmap’s Week-1 task asks.

Policy iteration

Value iteration improves the value every sweep and reads off a policy at the end. Policy iteration instead alternates two exact steps on the policy, and was Howard’s original 1960 algorithm Howard (1960) :

  1. Policy evaluation. Given π\policy, solve the linear system (IγPπ)vπ=rπ(I - \discount P_\policy)\,\valuefn_\policy = \reward_\policy for vπ\valuefn_\policy (the affine fixed point of Tπ\bellman^\policy — Def. 1.4).
  2. Policy improvement. Set π(s)=arg maxaqπ(s,a)\policy'(s) = \argmax_a \qfn_\policy(s,a), greedy w.r.t. the freshly evaluated values.

Repeat until the policy stops changing. The step that makes this work is:

Theorem 1.4 (Policy improvement theorem).

Let π\policy' be greedy with respect to vπ\valuefn_\policy, so that qπ(s,π(s))vπ(s)\qfn_\policy(s, \policy'(s)) \ge \valuefn_\policy(s) for all ss. Then vπ(s)vπ(s)\valuefn_{\policy'}(s) \ge \valuefn_\policy(s) for all ss, with strict improvement in some state unless π\policy is already optimal.

Proof.

Expand the greedy inequality and re-apply it along the trajectory:

vπ(s)qπ(s,π(s))(greedy choice)=Eπ ⁣[Rt+1+γvπ(St+1)St=s](definition of qπ under the first action π)Eπ ⁣[Rt+1+γqπ(St+1,π(St+1))St=s](apply the greedy inequality at St+1)Eπ ⁣[k0γkRt+k+1St=s]=vπ(s)(unroll; the tail telescopes to vπ).\begin{aligned} \valuefn_\policy(s) &\le \qfn_\policy(s, \policy'(s)) && \text{(greedy choice)} \\ &= \E_{\policy'}\!\big[\, R_{t+1} + \discount\, \valuefn_\policy(S_{t+1}) \mid S_t = s \,\big] && \text{(definition of }\qfn_\policy\text{ under the first action }\policy'\text{)} \\ &\le \E_{\policy'}\!\big[\, R_{t+1} + \discount\, \qfn_\policy(S_{t+1}, \policy'(S_{t+1})) \mid S_t = s \,\big] && \text{(apply the greedy inequality at }S_{t+1}\text{)} \\ &\le \cdots \le \E_{\policy'}\!\Big[\, \textstyle\sum_{k\ge 0} \discount^{k} R_{t+k+1} \mid S_t = s \,\Big] = \valuefn_{\policy'}(s) && \text{(unroll; the tail telescopes to }\valuefn_{\policy'}\text{).} \end{aligned}

If no state improves strictly, the greedy inequality holds with equality everywhere, which is the Bellman optimality equation — so π\policy is already optimal. \qquad\blacksquare

Because the MDP is finite there are finitely many deterministic policies, each iteration strictly improves until none does, so policy iteration terminates at an exact optimum in finitely many steps. Value iteration and policy iteration are the two poles of what Sutton & Barto Sutton & Barto (2018) call generalized policy iteration: any interleaving of “make the value consistent with the policy” and “make the policy greedy for the value” converges to the same fixed point. Week 2 studies what happens between the poles — asynchronous, Gauss–Seidel, and prioritized sweeps.

The dynamic-programming bridge

The roadmap’s Week-1 writing prompt is to place the Bellman equation beside its continuous cousin. They are the same principle of optimality at different limits. Discretize time and state and the cost-to-go obeys the discrete Bellman recursion above; take the limit of vanishing time-step on a continuous state and the same recursion becomes the Hamilton–Jacobi–Bellman equation. Control flips the convention from maximizing reward to minimizing a running cost =r\ell = -\reward, with cost-to-go JJ; the stationary discounted HJB equation is

ρJ(x)=mina[(x,a)+J(x) ⁣f(x,a)],ρ=lnγ,\rho\, J(x) = \min_{a}\Big[\, \ell(x,a) + \nabla J(x)^{\!\top} f(x,a) \,\Big], \qquad \rho = -\ln\discount,

whose ρJ\rho J term is the continuous-time echo of the discount γ\discount that drove the contraction above (set ρ=0\rho = 0 for the undiscounted limit). Value iteration is the fixed-point solver for the discrete equation; the LQR Riccati recursion (Ch. 13) is the closed-form solver for the HJB equation when ff is linear and \ell is quadratic. Holding this identity in view is the point of the whole curriculum: control theory and RL are two dialects for the same fixed point.

What’s next

  • Week 2 opens up the iteration itself: asynchronous and Gauss–Seidel value iteration, prioritized sweeping, real-time DP, and residual scheduling — the practical face of “iterate the contraction.”
  • Week 3 replaces the known model p\transition with sampled returns (Monte Carlo), the first step away from dynamic programming toward learning.

Exercises

  1. (Prove) Show that the Bellman expectation operator Tπ\bellman^\policy is a γ\discount-contraction in \norm{\cdot}_\infty. (This is the no-max\max case of Theorem 1.3.)

    Solution

    For any ss, (Tπu)(s)(Tπv)(s)=γspπ(ss)(u(s)v(s))(\bellman^\policy u)(s) - (\bellman^\policy v)(s) = \discount \sum_{s'} \transition_\policy(s'\mid s)\,(u(s') - v(s')) since rπ(s)\reward_\policy(s) cancels. Taking absolute values and using the triangle inequality with spπ(ss)=1\sum_{s'}\transition_\policy(s'\mid s)=1, (Tπu)(s)(Tπv)(s)γspπ(ss)uv=γuv\lvert(\bellman^\policy u)(s) - (\bellman^\policy v)(s)\rvert \le \discount \sum_{s'}\transition_\policy(s'\mid s)\norm{u-v}_\infty = \discount\norm{u-v}_\infty. Maximizing over ss gives the claim. (No max\max-nonexpansiveness step is needed, because Tπ\bellman^\policy is affine.)

  2. (Derive) State and prove the Bellman expectation equation for the action-value qπ(s,a)\qfn_\policy(s,a).

    Solution

    Conditioning on the first transition and then the next action,

    qπ(s,a)=s,rp(s,rs,a)[r+γaπ(as)qπ(s,a)].\qfn_\policy(s,a) = \sum_{s',r}\transition(s',r\mid s,a)\Big[\,r + \discount \sum_{a'}\policy(a'\mid s')\,\qfn_\policy(s',a')\,\Big].

    The derivation is identical to Theorem 1.1 with the first action held fixed at aa and the recursion closing on qπ\qfn_\policy via vπ(s)=aπ(as)qπ(s,a)\valuefn_\policy(s') = \sum_{a'}\policy(a'\mid s')\qfn_\policy(s',a').

  3. (Compute) Take the two-state MDP S={A,B}\statespace=\{A,B\}, actions {stay,switch}\{\text{stay},\text{switch}\}, γ=0.9\discount=0.9, with deterministic dynamics: in AA, stay A\to A (r=0r=0) and switch B\to B (r=0r=0); in BB, stay B\to B (r=1r=1) and switch A\to A (r=0r=0). Evaluate the always-stay policy by solving (IγPπ)vπ=rπ(I-\discount P_\policy)\valuefn_\policy = \reward_\policy.

    Solution

    Under always-stay, Pπ=IP_\policy = I and rπ=(0,1)\reward_\policy = (0, 1), so (I0.9I)vπ=(0,1)(I - 0.9 I)\valuefn_\policy = (0,1) gives 0.1vπ=(0,1)0.1\,\valuefn_\policy = (0,1), i.e. vπ=(0,10)\valuefn_\policy = (0, 10). State BB self-loops collecting 11 forever (1/(1γ)=101/(1-\discount)=10); state AA self-loops collecting 00. This is the exact case the companion test asserts.

  4. (Prove) Derive the geometric error bound of Corollary 1.1(2), VkvγkV0v\norm{V_k - \optvaluefn}_\infty \le \discount^{k}\norm{V_0 - \optvaluefn}_\infty, from the contraction property and Tv=v\bellmanopt\optvaluefn=\optvaluefn.

    Solution

    Vkv=TVk1TvγVk1v\norm{V_k - \optvaluefn}_\infty = \norm{\bellmanopt V_{k-1} - \bellmanopt \optvaluefn}_\infty \le \discount\norm{V_{k-1}-\optvaluefn}_\infty by Theorem 1.3 and the fixed-point identity. Iterating the inequality kk times gives the bound.

  5. (Implement) Add the certified stopping rule to value iteration: stop when ΔV<ε(1γ)/γ\norm{\Delta V}_\infty < \varepsilon(1-\discount)/\discount and verify empirically that the returned VV satisfies Vv<ε\norm{V-\optvaluefn}_\infty < \varepsilon on the two-state MDP (using the analytic v=(9,10)\optvaluefn=(9,10)). Run against experiments/python/week01/dp.py.

    Solution

    The bound follows by writing Vkv=jk(Vj+1Vj)V_k - \optvaluefn = \sum_{j\ge k}(V_{j+1}-V_j) and bounding each term by the contraction: Vj+1VjγjkVkVk1\norm{V_{j+1}-V_j}_\infty \le \discount^{\,j-k}\norm{V_{k}-V_{k-1}}_\infty, a geometric series summing to γ1γΔV\frac{\discount}{1-\discount}\norm{\Delta V}_\infty. The companion’s value_iteration(..., tol=ε) implements this and its test asserts the resulting error is below ε\varepsilon.

  6. (Extend) Show the optimal value function is the solution of the linear program minvsv(s)\min_{v}\sum_s v(s) subject to v(s)r(s,a)+γsp(ss,a)v(s)v(s) \ge \reward(s,a) + \discount \sum_{s'}\transition(s'\mid s,a)v(s') for all s,as,a.

    Solution

    Feasibility means vTvv \ge \bellmanopt v pointwise; by monotonicity of T\bellmanopt this implies vvv \ge \optvaluefn, so v\optvaluefn is the smallest feasible point and minimizing sv(s)\sum_s v(s) selects it. The constraints linearize the max\max (one inequality per action), turning the nonlinear optimality equation into an LP — the basis of the dual/occupancy-measure view revisited with safe RL in Week 25.

Companion code

The Week-1 companion lives at experiments/python/week01/ (the repo’s three-language convention). It is pure NumPy at the core — the algorithms and the correctness test carry no environment dependency — with an optional Gymnasium adapter for the canonical FrozenLake showcase.

  • dp.py — the Bellman optimality operator as an explicit function, plus value_iteration, policy_evaluation (exact linear solve), policy_improvement, and policy_iteration, all on a generic finite MDP (P, R, gamma).
  • test_dp.py — mathematical-correctness tests: convergence to the analytic v=(9,10)\optvaluefn=(9,10) on the two-state MDP of Exercise 3, the contraction inequality of Theorem 1.3, the fixed-point identity, the geometric bound, and VI/PI agreement (FrozenLake checks run only when gymnasium is installed).
  • frozenlake.py — builds (P, R, gamma) from gymnasium’s FrozenLake-v1 and runs VI and PI as a worked showcase.
# core algorithms + correctness tests (no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week01/test_dp.py -q

# canonical FrozenLake showcase (optional extra: pip install "gymnasium")
PYTHONPATH=. python experiments/python/week01/frozenlake.py
Part I · Foundations Week 2 Published async_dp.py test_async_dp.py

Asynchronous and Prioritized Dynamic Programming

Keeping the gamma-contraction but varying the schedule: asynchronous and Gauss–Seidel value iteration, prioritized sweeping, and real-time DP. Why the order of backups is free — monotonicity plus a constant-shift identity — and how update order and residual size set the practical convergence rate.

Asynchronous and Prioritized Dynamic Programming

Where we are. Chapter 1 proved that value iteration and policy iteration both reach the optimal value function because the Bellman operators are γ\discount-contractions. Both methods share one extravagance: every sweep backs up every state, in lockstep, whether or not that state’s value has stopped moving. This chapter keeps the contraction and varies only the schedule — which states we back up, in what order, and using whose values. The load-bearing claim is that the order is nearly free: as long as every state is updated infinitely often, asynchronous dynamic programming converges to the same fixed point v\optvaluefn, and ordering the updates well — Gauss–Seidel, prioritized sweeping, real-time DP — buys large savings without touching the convergence theory.

Generalized policy iteration: the space between the poles

Chapter 1 closed by naming value iteration and policy iteration the two poles of generalized policy iteration (GPI): VI applies one optimality backup everywhere and reads off a greedy policy at the end; PI evaluates a policy exactly (a linear solve) and then improves it greedily. The methods in between truncate the evaluation. Modified policy iteration Puterman (1994) runs mm sweeps of the expectation operator Tπ\bellman^\policy in place of the exact solve: m=1m=1 recovers value iteration, mm\to\infty recovers policy iteration, and intermediate mm trades evaluation cost against the number of outer improvements. Sutton & Barto Sutton & Barto (2018) frame the entire family as two interacting processes — make the value consistent with the policy, make the policy greedy for the value — that converge to the same fixed point regardless of how finely they are interleaved.

This chapter relaxes a different lockstep: not how much we evaluate between improvements, but which states we touch on each pass and in what order.

Asynchronous value iteration

Synchronous value iteration computes the whole vector TVk\bellmanopt V_k from VkV_k and only then overwrites — every state is backed up from the same old values. Asynchronous value iteration drops that barrier.

Definition 2.1 (Asynchronous value iteration).

Asynchronous value iteration maintains a single value vector VV and repeats: select a state ss and overwrite

V(s)    (TV)(s)=maxa[r(s,a)+γsp(ss,a)V(s)],V(s) \;\leftarrow\; (\bellmanopt V)(s) = \max_{a}\Big[\reward(s,a) + \discount \sum_{s'}\transition(s'\mid s,a)\,V(s')\Big],

using the current entries of VV — including any updated earlier in the pass. States may be visited in any order, subject only to the fairness condition that every state is selected infinitely often. The special case that sweeps the states in a fixed order 1,2,,S1,2,\dots,\lvert\statespace\rvert, each backup seeing the freshly written values of its predecessors, is Gauss–Seidel value iteration.

The synchronous operator of Chapter 1 reuses no in-pass information; Gauss–Seidel reuses all of it; general asynchronous iteration lives anywhere between. The question is whether dropping the barrier costs us the convergence guarantee. It does not, and the reason is two structural facts that cost one line each.

Why the order is free: monotonicity and the constant-shift identity

Proposition 2.1 (Monotonicity and constant-shift).

Let TT be either Bellman operator (Tπ\bellman^\policy or T\bellmanopt). For all u,vRSu,v\in\R^{\statespace} and every constant cRc\in\R, writing 1\mathbf 1 for the all-ones vector:

  1. Monotonicity. If uvu \le v componentwise, then TuTvTu \le Tv.
  2. Constant-shift. T(v+c1)=Tv+γc1T(v + c\,\mathbf 1) = Tv + \discount c\,\mathbf 1.
Proof.

Both read off the backup Qv(s,a)r(s,a)+γsp(ss,a)v(s)Q_v(s,a) \defeq \reward(s,a) + \discount\sum_{s'} \transition(s'\mid s,a)\,v(s'), of which (Tv)(s)=maxaQv(s,a)(\bellmanopt v)(s)=\max_a Q_v(s,a) and (Tπv)(s)=aπ(as)Qv(s,a)(\bellman^\policy v)(s)=\sum_a\policy(a\mid s)Q_v(s,a).

Monotonicity. If uvu\le v then sp(ss,a)u(s)sp(ss,a)v(s)\sum_{s'}\transition(s'\mid s,a)u(s') \le \sum_{s'}\transition(s'\mid s,a)v(s') because the transition weights are nonnegative; hence Qu(s,a)Qv(s,a)Q_u(s,a)\le Q_v(s,a) for every (s,a)(s,a), and taking the max\max (or the π\policy-average) over aa preserves the inequality.

Constant-shift. Since sp(ss,a)=1\sum_{s'}\transition(s'\mid s,a)=1, replacing vv by v+c1v+c\,\mathbf 1 adds exactly γc\discount c to every Qv(s,a)Q_v(s,a); and maxa(xa+γc)=maxaxa+γc\max_a(x_a+\discount c)=\max_a x_a+\discount c (likewise for the average). \qquad\blacksquare

These two facts — and not the full contraction algebra — are what make asynchronous order irrelevant.

Theorem 2.1 (Asynchronous convergence).

Run asynchronous value iteration (Def. 2.1) from any V0V_0, updating every state infinitely often. Then VkvV_k \to \optvaluefn. Moreover, grouping the schedule into rounds — a round ends once every state has been backed up at least once since the round began — the sup-norm error contracts by a factor γ\discount per round:

Vafter round mvγmV0v.\norm{V_{\text{after round } m} - \optvaluefn}_\infty \le \discount^{m}\, \norm{V_0 - \optvaluefn}_\infty .
Proof.

Let cV0vc \defeq \norm{V_0-\optvaluefn}_\infty, so the iterate starts inside the box vc1V0v+c1\optvaluefn - c\,\mathbf 1 \le V_0 \le \optvaluefn + c\,\mathbf 1. Suppose at some moment VV lies in this box. Back up any state ss. By monotonicity and then constant-shift (Prop. 2.1), using Tv=v\bellmanopt\optvaluefn=\optvaluefn,

(TV)(s)(T(v+c1))(s)=v(s)+γc,(TV)(s)v(s)γc.(\bellmanopt V)(s) \le \big(\bellmanopt(\optvaluefn + c\,\mathbf 1)\big)(s) = \optvaluefn(s) + \discount c, \qquad (\bellmanopt V)(s) \ge \optvaluefn(s) - \discount c .

So the updated entry lands in the smaller box [v(s)γc,v(s)+γc][v(s)c,v(s)+c][\optvaluefn(s)-\discount c,\, \optvaluefn(s)+\discount c]\subseteq[\optvaluefn(s)-c,\,\optvaluefn(s)+c] (as γ<1\discount<1). Two consequences: the iterate never leaves the original cc-box, and every state, once backed up, sits in the γc\discount c-box. Crucially, later backups in the round still read values inside the original cc-box, so they too map into the γc\discount c-box. Hence after one full round Vvγc\norm{V-\optvaluefn}_\infty\le\discount c. Applying the argument round by round gives γmc\discount^{m}c after mm rounds; fairness guarantees infinitely many rounds, so VkvV_k\to\optvaluefn. \qquad\blacksquare

Synchronous value iteration is the special case in which a round is a single simultaneous sweep, and the bound collapses to Chapter 1’s VkvγkV0v\norm{V_k-\optvaluefn}_\infty\le\discount^{k}\norm{V_0-\optvaluefn}_\infty. The abstract version — monotone contraction operators converge under any fair asynchronous schedule — is the backbone of Bertsekas’s treatment of dynamic programming Bertsekas (2017) .

Gauss–Seidel sweeps and the numerical-analysis analogy

The names are borrowed deliberately. Solving a linear system Ax=bAx=b by splitting AA and iterating, Jacobi updates every coordinate from the previous iterate (a synchronous sweep), while Gauss–Seidel updates coordinate ii using the already-updated coordinates 1,,i11,\dots,i-1 — in place, no second buffer.

Value iteration is the nonlinear analogue: synchronous VI is Jacobi, Gauss–Seidel VI is Gauss–Seidel. The in-place version typically converges in fewer sweeps because information propagates across the state space within a single pass rather than waiting a full sweep per hop, and it halves the memory (no separate Vk+1V_{k+1} buffer). Theorem 2.1 already covers it: a left-to-right sweep updates every state once, so it is exactly one round.

The order within a sweep is now a design lever. If states are numbered along the direction reward information flows — outward from a goal, say — a single Gauss–Seidel sweep can propagate values across the whole space, whereas the reverse order wastes most of the pass. This is the seed of prioritizing the order rather than fixing it.

The Bellman residual and residual scheduling

How do we know where values are still moving? The same quantity that certifies stopping. Recall from Chapter 1 the local Bellman residual at a state, the amount one more backup would change it,

Δ(s)(TV)(s)V(s),\Delta(s) \defeq \big\lvert (\bellmanopt V)(s) - V(s) \big\rvert ,

and that the global residual TVV\norm{\bellmanopt V - V}_\infty controls the true error through Vvγ1γTVV\norm{V-\optvaluefn}_\infty \le \frac{\discount}{1-\discount} \norm{\bellmanopt V - V}_\infty. A uniform sweep spends equal effort on states with Δ(s)0\Delta(s)\approx 0 (already converged) and on states with large Δ(s)\Delta(s) (still far off). Residual scheduling is the obvious correction: back up states in decreasing order of Δ(s)\Delta(s), and skip states whose residual sits below a tolerance θ\theta. Asynchronous convergence (Thm. 2.1) licenses any such order as long as no state is starved.

Prioritized sweeping

Prioritized sweeping Moore & Atkeson (1993) makes residual scheduling precise with a priority queue, and adds the key idea that a backup only changes a state’s predecessors, so only their priorities need refreshing.

initialize V ← 0;  priority queue PQ keyed by Bellman residual
seed PQ with every state s at priority |(𝒯* V)(s) − V(s)|
repeat:
    s ← PQ.pop_max()                      # largest-residual state
    V(s) ← (𝒯* V)(s)                       # one optimality backup
    for each (s̄, a) with p(s | s̄, a) > 0:  # predecessors that feed s
        e ← |(𝒯* V)(s̄) − V(s̄)|             # their refreshed residual
        if e > θ:  PQ.push(s̄, priority = e)
until PQ empty                            # every residual below θ

On a sparse-reward problem the queue concentrates work near the frontier where values are actually changing, leaving the converged interior untouched; Moore and Atkeson report order-of-magnitude reductions in backups to reach a target accuracy on large MDPs. The method is exact-model dynamic programming — it needs p\transition to form both the backup and the predecessor sets — but it is the direct ancestor of the sampled prioritized-replay schemes that return in Week 6.

Real-time dynamic programming

Prioritized sweeping orders updates by residual; real-time dynamic programming (RTDP) orders them by reachability Barto et al. (1995) . Instead of enumerating states, RTDP backs up only the states it actually visits while acting greedily, interleaving computation with control.

repeat for each trial:
    s ← start state
    while s is not terminal:
        V(s) ← (𝒯* V)(s)               # back up the current state
        a ← greedy action at s w.r.t. V
        s ← sample s' ∼ p(· | s, a)     # follow the greedy trajectory

Because backups concentrate on the states reachable under good policies, RTDP can solve problems whose full state space is far too large to sweep, converging on the relevant subset without ever touching the rest. It is also the conceptual hinge to model-free learning: replace the expectation sp(ss,a)\sum_{s'}\transition(s'\mid s,a) in the backup with the single sampled successor, and the deterministic backup becomes a stochastic one — RTDP’s Monte-Carlo cousin is Q-learning, which Weeks 3–4 build from exactly this move.

Counting the work: iterations and samples

How many backups does the discount cost us? For synchronous value iteration the geometric bound answers it directly.

Proposition 2.2 (Iteration complexity of value iteration).

Starting from V0=0V_0=0, synchronous value iteration reaches Vkvε\norm{V_k-\optvaluefn}_\infty\le\varepsilon after

k    ln ⁣(Rmax/[ε(1γ)])ln(1/γ)  =  O ⁣(11γln1ε)k \;\ge\; \frac{\ln\!\big(R_{\max}\,/\,[\varepsilon(1-\discount)]\big)}{\ln(1/\discount)} \;=\; O\!\Big(\frac{1}{1-\discount}\,\ln\frac{1}{\varepsilon}\Big)

sweeps, where Rmax=maxs,ar(s,a)R_{\max}=\max_{s,a}\lvert\reward(s,a)\rvert.

Proof.

With V0=0V_0=0, Chapter 1’s Corollary 1.1 gives VkvγkvγkRmax/(1γ)\norm{V_k-\optvaluefn}_\infty \le \discount^{k}\norm{\optvaluefn}_\infty \le \discount^{k} R_{\max}/(1-\discount), using the return bound vRmax/(1γ)\norm{\optvaluefn}_\infty\le R_{\max}/(1-\discount). Requiring the right side ε\le\varepsilon and taking logs gives the displayed kk. The order form uses ln(1/γ)1γ\ln(1/\discount)\ge 1-\discount for γ(0,1)\discount\in(0,1), so 1/ln(1/γ)1/(1γ)1/\ln(1/\discount)\le 1/(1-\discount). \qquad\blacksquare

The count is only logarithmic in the accuracy ε\varepsilon but linear in the effective horizon 1/(1γ)1/(1-\discount) — push γ\discount toward 11 and the work grows without bound. Tseng Tseng (1990) made this precise, showing stationary discounted MDPs are solved in time proportional to log\log of the horizon. The same 1/(1γ)1/(1-\discount) governs the sample cost once the model is unknown: given only a generative model (a sampler for p\transition), estimating an ε\varepsilon-optimal value needs O~ ⁣(SA(1γ)3ε2)\tilde O\!\big(\lvert\statespace\rvert\lvert\actionspace\rvert (1-\discount)^{-3}\varepsilon^{-2}\big) samples, a bound matched from both sides by Azar et al. Azar et al. (2013) and achieved in near-optimal time by the variance-reduced value iteration of Sidford et al. Sidford et al. (2018) . That cubic blow-up in the horizon is precisely the pressure that drives the curriculum out of the exact-model regime and into sampling (Week 3) and function approximation (Week 5).

The dynamic-programming bridge

Asynchronous DP is where the numerical-analysis reading of value iteration becomes literal. A value function is a point in RS\R^{\statespace}; the Bellman operator is a nonlinear map whose fixed point we chase; and the Jacobi / Gauss–Seidel / residual-ordered hierarchy is the same one used to solve large sparse linear systems, transplanted to a contraction that happens to involve a max\max. Two threads carry forward:

  • To online control (MPC, Week 15). RTDP’s instinct — compute only over the states you will actually visit, refreshing as you go — is the tabular shadow of model predictive control, which re-solves a short-horizon optimal-control problem from the current state at every step instead of storing v\optvaluefn everywhere.
  • To model-free RL (Weeks 3–6). Replace the model expectation in a backup with a sampled successor and asynchronous DP becomes temporal-difference learning; prioritized sweeping becomes prioritized experience replay. The schedule ideas of this chapter reappear, now applied to which transitions to learn from.

What’s next

  • Week 3 removes the known model entirely: values are estimated from sampled returns (Monte Carlo), trading the model expectation for an empirical average and introducing the bias–variance questions that dominate the rest of Part I.
  • Week 4 fuses the two views — bootstrapping like DP, sampling like Monte Carlo — into temporal-difference learning, the stochastic-approximation form of the asynchronous backup above.

Exercises

  1. (Prove) Show that the Bellman optimality operator satisfies the constant-shift identity T(v+c1)=Tv+γc1\bellmanopt(v+c\,\mathbf 1)=\bellmanopt v+\discount c\,\mathbf 1 for every cRc\in\R (Prop. 2.1(2)), and explain where the proof of Theorem 2.1 uses it.

    Solution

    For each (s,a)(s,a), sp(ss,a)(v(s)+c)=sp(ss,a)v(s)+c\sum_{s'}\transition(s'\mid s,a)\,(v(s')+c)=\sum_{s'} \transition(s'\mid s,a)v(s')+c since the transition row sums to 11, so every Qv(s,a)Q_v(s,a) rises by γc\discount c; the per-state maxa\max_a then rises by γc\discount c. Theorem 2.1 uses it to evaluate T(v±c1)=v±γc1\bellmanopt(\optvaluefn\pm c\,\mathbf 1)=\optvaluefn\pm\discount c\,\mathbf 1, which shrinks the bounding box by a factor γ\discount each round.

  2. (Derive) Argue that one full Gauss–Seidel sweep is one “round” in the sense of Theorem 2.1, and conclude that Gauss–Seidel value iteration converges at least as fast (per sweep) as synchronous value iteration.

    Solution

    A left-to-right sweep updates each state exactly once, so by the end every state has been backed up since the sweep began — a round — and Theorem 2.1 gives VvγVprev\norm{V-\optvaluefn}_\infty\le\discount\norm{V_{\text{pre}}-\optvaluefn}_\infty. Synchronous VI achieves the same per-sweep factor but reads only stale values; Gauss–Seidel additionally uses freshly updated predecessors within the sweep, so its actual contraction per sweep is no worse and usually strictly better.

  3. (Compute) A backup at state ss changes the residual of which other states? Give the set in terms of p\transition, and explain why prioritized sweeping only re-queues those.

    Solution

    Only the predecessors {sˉ:a, p(ssˉ,a)>0}\{\bar s : \exists a,\ \transition(s\mid\bar s,a)>0\}: a state sˉ\bar s‘s backup depends on V(s)V(s) only if some action from sˉ\bar s can reach ss. Every other state’s residual is unchanged by overwriting V(s)V(s), so re-evaluating it would be wasted work — hence the predecessor loop in the prioritized-sweeping inner step.

  4. (Implement) In the companion, run synchronous VI, Gauss–Seidel VI, and prioritized sweeping on the stochastic gridworld and verify (i) all three reach the policy-iteration value v\optvaluefn, and (ii) to reach a fixed target accuracy ε\varepsilon — set the prioritized-sweeping threshold θ=ε(1γ)/γ\theta=\varepsilon(1-\discount)/\discount — prioritized sweeping spends fewer state-backups than the synchronous sweep count S×(sweeps)\lvert\statespace\rvert\times(\text{sweeps}).

    Solution

    See experiments/python/week02/test_async_dp.py: it asserts each method matches dp.policy_iteration, and that at ε=103\varepsilon=10^{-3} the prioritized-sweeping backup count is below the synchronous total. The win is in reaching useful accuracy fast — the queue concentrates on high-residual states — and it widens with S\lvert\statespace\rvert; the edge narrows as ε\varepsilon approaches machine precision, where residuals near the fixed point thrash the queue.

  5. (Extend) Perturb the gridworld’s transition model (raise the slip probability) and measure how the optimal policy and v\optvaluefn change. Relate the sensitivity to the 1/(1γ)1/(1-\discount) factor in Proposition 2.2.

    Solution

    A model error of size δ\delta in p\transition perturbs the fixed point by O ⁣(γδ/(1γ))O\!\big(\discount\delta/(1-\discount)\big) (a simulation-lemma bound — a perturbation bound on the DP fixed point): the same effective horizon that sets the iteration count also amplifies model error, which is the quantitative reason model misspecification hurts more as γ1\discount\to1. The companion’s --slip sweep exhibits the effect.

Companion code

The Week-2 companion lives at experiments/python/week02/ and reuses the Week-1 primitives — it imports dp.py for action_values, bellman_optimality_operator, value_iteration, and policy_iteration rather than reimplementing them, so the asynchronous methods are checked against the same reference fixed point.

  • gridworld.py — builds a parametric stochastic gridworld as a generic (P, R, gamma) MDP in the dp.py representation: a goal cell, a per-step cost, and a slip probability that randomizes the intended move. Pure NumPy.
  • async_dp.py — Gauss–Seidel (in-place) value iteration and model-based prioritized sweeping with an explicit predecessor index and a backup counter, both built on the dp.py operators.
  • test_async_dp.py — mathematical-correctness tests: Gauss–Seidel and prioritized sweeping converge to the same v\optvaluefn as value_iteration and policy_iteration (from arbitrary starts); the constant-shift and monotonicity properties of Proposition 2.1 hold numerically; and prioritized sweeping uses fewer backups than a synchronous sweep schedule to reach a target accuracy on a sparse-reward grid.
# core algorithms + correctness tests (pure NumPy, no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week02/test_async_dp.py -q

# worked comparison + optional value/residual heatmaps (saved locally, not committed)
PYTHONPATH=. python experiments/python/week02/async_dp.py --plot
Part I · Foundations Week 3 Published mc.py test_mc.py

Monte Carlo Methods

Estimating value from sampled returns when the model is unknown: first-visit Monte Carlo prediction, Monte Carlo control by generalized policy iteration, and off-policy learning by importance sampling. Monte Carlo estimation as quadrature — unbiased, model-free, and high-variance, with the fragility of off-policy correction.

Monte Carlo Methods

Where we are. Chapters 1 and 2 assumed the model p\transition was known and computed value functions by iterating the Bellman operator. This chapter takes the first step of reinforcement learning proper: the model is unknown, and value is estimated from sampled experience. The load-bearing shift is one substitution — replace the model expectation inside the value definition with an empirical average over complete sampled returns. The estimate is unbiased and needs no model, but it pays for that in variance and requires episodic tasks that terminate.

From a known model to sampled returns

The definition of the state-value function (Chapter 1) is an expectation over trajectories,

vπ(s)=Eπ ⁣[GtSt=s],Gt=k=0Tt1γkRt+k+1,\valuefn_\policy(s) = \E_\policy\!\big[\, G_t \,\big|\, S_t = s \,\big], \qquad G_t = \sum_{k=0}^{T-t-1} \discount^{k} R_{t+k+1},

where an episode runs from tt to a terminal time TT. Dynamic programming never sampled GtG_t; it used the model to turn this expectation into the Bellman recursion. Monte Carlo does the opposite: it leaves the expectation alone and estimates it the way one estimates any expectation without a closed form — draw samples and average.

That reframing — value estimation as quadrature — sets the terms for the whole chapter. The sample mean of returns is unbiased regardless of dimension, and its error falls like 1/N1/\sqrt{N} in the number of episodes NN; the variance σ2\sigma^2 of the return is the price of having no model. Two structural requirements follow: episodes must terminate (an infinite return has no sample), and estimating vπ\valuefn_\policy requires acting under π\policy — or correcting for the fact that we did not, which is the off-policy problem below.

First-visit Monte Carlo prediction

To estimate vπ\valuefn_\policy, generate episodes under π\policy and, for each state, average the returns that followed visits to it. The first-visit variant averages only the return following the first time each state is reached in an episode, which gives one independent draw per episode (every-visit MC reuses correlated within-episode returns).

Definition 3.1 (First-visit Monte Carlo estimator).

Run NN episodes under π\policy. For state ss, let I(s)\mathcal{I}(s) index the episodes in which ss is visited, and for episode iI(s)i \in \mathcal{I}(s) let G(i)(s)G^{(i)}(s) be the return from the first visit to ss. The first-visit Monte Carlo estimate is the sample mean

VN(s)1I(s)iI(s)G(i)(s).V_N(s) \defeq \frac{1}{\lvert\mathcal{I}(s)\rvert} \sum_{i \in \mathcal{I}(s)} G^{(i)}(s).
Proposition 3.1 (Unbiasedness and consistency).

Each first-visit return G(i)(s)G^{(i)}(s) is an independent sample of (GtSt=s)(G_t \mid S_t = s) under π\policy. Hence VN(s)V_N(s) is unbiased, E[VN(s)]=vπ(s)\E[V_N(s)] = \valuefn_\policy(s), and by the strong law of large numbers VN(s)vπ(s)V_N(s) \to \valuefn_\policy(s) almost surely as I(s)\lvert\mathcal{I}(s)\rvert \to \infty.

Proof.

Fix ss. In each episode the first visit to ss occurs at some time tt, and the return from that point, G(i)(s)G^{(i)}(s), is by construction a draw of the random variable GtG_t conditioned on St=sS_t = s and on following π\policy thereafter — so its expectation is exactly vπ(s)\valuefn_\policy(s) by the definition above. Different episodes are generated independently, and using only the first visit means each episode contributes one draw that does not depend on the others (every-visit sampling would reuse correlated within-episode returns). The G(i)(s)G^{(i)}(s) are thus i.i.d. with mean vπ(s)\valuefn_\policy(s); a sample mean of i.i.d. draws is unbiased, and the strong law gives almost-sure convergence. \qquad\blacksquare

The estimator carries no bias and no model — it never references p\transition, only realized returns. Its weakness is variance: a single return aggregates all the randomness of a whole trajectory, so σ2\sigma^2 can be large and convergence is only 1/N1/\sqrt{N}. Sutton & Barto Sutton & Barto (2018) treat first- and every-visit MC and the bias each incurs; every-visit MC is biased in finite samples (its within-episode returns overlap, so the averaged samples are correlated, though the bias vanishes as NN\to\infty) but also consistent, and is often simpler to implement.

Monte Carlo control

Prediction estimates vπ\valuefn_\policy; control seeks v\optvaluefn. Monte Carlo control is generalized policy iteration (Chapter 2) with the evaluation step done by sampling: estimate the action-value qπ\qfn_\policy from returns, then improve the policy greedily, π(s)=arg maxaqπ(s,a)\policy'(s) = \argmax_a \qfn_\policy(s,a).

The catch is exploration. With no model, an action that is never tried has no return to average, so its value is unknown and greedy improvement can lock onto a wrong choice. Two standard fixes guarantee every action keeps being sampled:

  • Exploring starts — begin episodes from a random state–action pair, so every (s,a)(s,a) seeds infinitely many returns.
  • ε\varepsilon-soft policies — keep the behaviour policy stochastic (π(as)ε/A\policy(a\mid s) \ge \varepsilon/\lvert\actionspace\rvert for all aa), so no action is ever starved; improvement then converges to the best ε\varepsilon-soft policy rather than the unconstrained optimum.

Either way the GPI logic of Chapter 1 carries over — evaluate, improve, repeat — with the policy-improvement theorem still guaranteeing monotone improvement at each step that uses an accurate q\qfn estimate. The convergence of exploring-starts MC control is taken as a working assumption here (it is not fully settled in general); Sutton & Barto Sutton & Barto (2018) discuss the subtlety.

Off-policy prediction and importance sampling

Often we must estimate vπ\valuefn_\policy for a target policy π\policy while the data was generated by a different behaviour policy bb — for instance to evaluate a greedy policy from exploratory data. Naively averaging returns from bb estimates vb\valuefn_b, not vπ\valuefn_\policy. Importance sampling corrects the distribution mismatch by reweighting each return.

Definition 3.2 (Importance-sampling ratio and estimators).

For a trajectory from tt to termination TT, the importance-sampling ratio is the likelihood ratio of the action choices (the dynamics cancel, being shared),

ρt:T1k=tT1π(AkSk)b(AkSk).\rho_{t:T-1} \defeq \prod_{k=t}^{T-1} \frac{\policy(A_k \mid S_k)}{b(A_k \mid S_k)} .

Over the first-visit returns G(i)G^{(i)} with ratios ρ(i)\rho^{(i)}, the ordinary and weighted importance-sampling estimators of vπ(s)\valuefn_\policy(s) are

VNord(s)=1I(s)iρ(i)G(i),VNwt(s)=iρ(i)G(i)iρ(i).V^{\text{ord}}_N(s) = \frac{1}{\lvert\mathcal{I}(s)\rvert}\sum_{i} \rho^{(i)} G^{(i)}, \qquad V^{\text{wt}}_N(s) = \frac{\sum_{i} \rho^{(i)} G^{(i)}}{\sum_{i} \rho^{(i)}} .
Proposition 3.2 (Ordinary IS is unbiased; weighted IS is consistent).

Assuming coverage (b(as)>0b(a\mid s) > 0 whenever π(as)>0\policy(a\mid s) > 0), ordinary importance sampling is unbiased, Eb[ρt:T1GtSt=s]=vπ(s)\E_b[\rho_{t:T-1} G_t \mid S_t = s] = \valuefn_\policy(s). Weighted importance sampling is biased in finite samples but consistent (VNwtvπV^{\text{wt}}_N \to \valuefn_\policy), and typically has far lower variance.

Proof.

The probability of a trajectory’s action sequence under π\policy equals ρt:T1\rho_{t:T-1} times its probability under bb, because the environment factors p(s,rs,a)\transition(s',r\mid s,a) appear identically in both and cancel — only the policy factors survive in the ratio. This is a change of measure: for any function ff of the trajectory, Eb[ρt:T1f]=Eπ[f]\E_b[\rho_{t:T-1}\, f] = \E_\policy[f]. Taking f=Gtf = G_t gives Eb[ρt:T1GtSt=s]=Eπ[GtSt=s]=vπ(s)\E_b[\rho_{t:T-1} G_t \mid S_t = s] = \E_\policy[G_t\mid S_t=s] = \valuefn_\policy(s), so the ordinary estimator (a sample mean of ρ(i)G(i)\rho^{(i)}G^{(i)}) is unbiased. The weighted estimator divides by iρ(i)\sum_i \rho^{(i)} rather than the count; as a ratio of two correlated sample means it is biased at finite NN, but both means converge (1Nρ(i)G(i)vπ\frac1N\sum\rho^{(i)}G^{(i)}\to\valuefn_\policy and 1Nρ(i)1\frac1N\sum\rho^{(i)}\to 1), so the ratio converges to vπ(s)\valuefn_\policy(s). \qquad\blacksquare

The two estimators sit at opposite ends of the bias–variance axis, and the variance end is where off-policy Monte Carlo becomes fragile. The ratio ρt:T1\rho_{t:T-1} is a product over the episode; if π\policy and bb differ enough — or the horizon is long — the product swings across orders of magnitude, and the variance of ordinary IS can be unbounded: a single rare trajectory with a huge ratio dominates the average.

Weighted IS caps this — its estimate can never exceed the largest observed return — trading a vanishing bias for a decisive variance reduction, which is why it is the practical default. Both degrade as the horizon grows and more ratio factors multiply in; this fragility is a major reason the field leans on the lower-variance, bootstrapped methods of Week 4.

The dynamic-programming bridge

Monte Carlo and dynamic programming estimate the same object vπ\valuefn_\policy from opposite information. DP needs the model and bootstraps — each value is written in terms of other current estimates — giving low variance but model dependence and bias while the estimates are wrong. MC needs only the ability to sample episodes and never bootstraps — each value is an average of full returns — giving no model dependence and no bias but high variance, and only for terminating tasks. Plotted on two axes, bootstrapping and sampling, DP samples nothing and bootstraps fully; MC samples fully and bootstraps not at all. Week 4’s temporal-difference learning is the missing corner: sample like MC, bootstrap like DP, and inherit a blend of their bias and variance.

What’s next

  • Week 4 introduces temporal-difference learning: replace the full sampled return GtG_t with the one-step bootstrap Rt+1+γV(St+1)R_{t+1} + \discount V(S_{t+1}), fusing Monte Carlo sampling with the dynamic-programming backup and removing the need to wait for an episode to terminate.
  • Week 5 confronts what happens when VV is a parametric approximation rather than a table, where bootstrapping and off-policy data interact dangerously.

Exercises

  1. (Prove) Show the first-visit Monte Carlo estimator is unbiased for vπ(s)\valuefn_\policy(s) for every fixed sample size, citing where independence across episodes is used (Prop. 3.1).

    Solution

    Each first-visit return G(i)(s)G^{(i)}(s) has E[G(i)(s)]=vπ(s)\E[G^{(i)}(s)] = \valuefn_\policy(s) by the definition of the value as the expected return from ss under π\policy. The estimator is the mean of I(s)\lvert\mathcal{I}(s)\rvert such draws, so by linearity E[VN(s)]=vπ(s)\E[V_N(s)] = \valuefn_\policy(s) — unbiased at any NN. Independence (one draw per episode, first visit only) is not needed for unbiasedness but is what makes the variance σ2/I(s)\sigma^2/\lvert\mathcal{I}(s)\rvert and licenses the strong law for consistency.

  2. (Derive) Starting from the trajectory likelihood under π\policy and bb, derive the importance-sampling ratio and show Eb[ρt:T1GtSt=s]=vπ(s)\E_b[\rho_{t:T-1} G_t \mid S_t=s] = \valuefn_\policy(s) (Prop. 3.2).

    Solution

    The probability of At,St+1,,STA_t,S_{t+1},\dots,S_T given St=sS_t=s factors as kπ-or-b(AkSk)p(Sk+1,Rk+1Sk,Ak)\prod_k \pi\text{-or-}b(A_k\mid S_k)\,\transition(S_{k+1},R_{k+1}\mid S_k,A_k). Dividing the π\policy-likelihood by the bb-likelihood, every p\transition factor cancels, leaving ρt:T1=k=tT1π(AkSk)/b(AkSk)\rho_{t:T-1} = \prod_{k=t}^{T-1}\policy(A_k\mid S_k)/b(A_k\mid S_k). Then Eb[ρt:T1GtSt=s]=trajb(traj)ρG=trajπ(traj)G=Eπ[GtSt=s]=vπ(s)\E_b[\rho_{t:T-1}G_t\mid S_t=s] = \sum_{\text{traj}} b(\text{traj})\,\rho\,G = \sum_{\text{traj}}\policy(\text{traj})\,G = \E_\policy[G_t\mid S_t=s]=\valuefn_\policy(s).

  3. (Compute) Target π\policy is greedy (prob. 1 on action aa^\star); behaviour bb is uniform over two actions. For an episode of length 3 that happens to take aa^\star at every step, compute ρ0:2\rho_{0:2}. What is ρ\rho if any step takes the non-target action?

    Solution

    Each on-target step contributes π/b=1/(1/2)=2\policy/b = 1/(1/2) = 2, so ρ0:2=23=8\rho_{0:2} = 2^3 = 8. If any step takes the non-target action, π=0\policy = 0 there, so ρ=0\rho = 0 — that trajectory contributes nothing, the discrete face of the variance problem (a few high-ratio trajectories carry the whole estimate).

  4. (Implement) On the companion’s random-walk MDP, verify first-visit MC converges to the analytic vπ\valuefn_\policy from dp.policy_evaluation, and that ordinary IS is unbiased while weighted IS has lower variance across seeds.

    Solution

    See experiments/python/week03/test_mc.py: it computes vπ\valuefn_\policy exactly with the Week-1 linear solve, then asserts first-visit MC matches it within a sampling tolerance, the ordinary-IS mean across many seeds is unbiased, and — when the behaviour policy under-samples the reward path (the heavy-tailed regime; under a mild mismatch ordinary IS is already low-variance) — the weighted-IS empirical variance is below the ordinary-IS variance.

  5. (Extend) Make the behaviour policy progressively worse (further from π\policy) and measure how ordinary- and weighted-IS variance grow. Relate the blow-up to the product structure of ρt:T1\rho_{t:T-1}.

    Solution

    As bb diverges from π\policy, individual step ratios stray from 11 and their product’s variance compounds geometrically in the horizon; ordinary-IS variance grows fastest (it can be unbounded), weighted-IS more slowly (bounded by the largest return). The companion’s --mismatch sweep exhibits the monotone growth.

Companion code

The Week-3 companion lives at experiments/python/week03/ and reuses the Week-1 linear solve (dp.policy_evaluation) as the exact oracle against which the sampled estimates are checked — the core suite has no environment dependency.

  • randomwalk.py — builds a small episodic random-walk MDP (terminal states at both ends) as a generic (P, R) array pair, so vπ\valuefn_\policy is available in closed form from dp.policy_evaluation.
  • mc.py — episode sampling from a generic (P, R, policy), first-visit MC prediction, and off-policy prediction with both ordinary and weighted importance sampling. Pure NumPy.
  • test_mc.py — statistical-correctness tests: first-visit MC converges to the analytic vπ\valuefn_\policy within a sampling tolerance; ordinary IS is unbiased across seeds; weighted IS has strictly lower empirical variance.
  • blackjack.py — the canonical Sutton & Barto Monte-Carlo showcase: MC control on Gymnasium’s Blackjack-v1 (optional extra; skipped when Gymnasium is absent).
# core algorithms + correctness tests (pure NumPy, no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week03/test_mc.py -q

# canonical Blackjack MC-control showcase (optional: pip install "gymnasium")
PYTHONPATH=. python experiments/python/week03/blackjack.py
Part I · Foundations Week 4 Published td.py test_td.py

Temporal-Difference Learning: TD(0), SARSA, and Q-Learning

Bootstrapping from sampled transitions: the TD(0) prediction update as a stochastic Euler step toward the Bellman fixed point, SARSA as on-policy control, and Q-learning as off-policy control. Where temporal-difference learning sits between Monte Carlo and dynamic programming, and what the SARSA/Q-learning split reveals on the cliff.

Temporal-Difference Learning: TD(0), SARSA, and Q-Learning

Where we are. Monte Carlo (Chapter 3) waited for a full episode to terminate, then averaged the realized return — unbiased, but high-variance and episodic-only. Dynamic programming (Chapters 1–2) never sampled at all; it bootstrapped each value from other current estimates through the model. Temporal-difference learning is the missing synthesis: sample one transition like Monte Carlo, and bootstrap from the current estimate like dynamic programming. The result learns online, from a single step, with no model and no wait for the episode to end. Its load-bearing object is the TD error δt=Rt+1+γV(St+1)V(St)\delta_t = R_{t+1} + \discount V(S_{t+1}) - V(S_t) — a one-sample estimate of the Bellman residual.

The TD(0) update

Given a single transition (St,Rt+1,St+1)(S_t, R_{t+1}, S_{t+1}) generated under π\policy, TD(0) nudges the value of the visited state toward a bootstrapped target:

V(St)    V(St)+α[Rt+1+γV(St+1)TD targetV(St)]=V(St)+αδt,V(S_t) \;\leftarrow\; V(S_t) + \alpha\,\big[\, \underbrace{R_{t+1} + \discount V(S_{t+1})}_{\text{TD target}} - V(S_t) \,\big] = V(S_t) + \alpha\,\delta_t ,

with step size α(0,1]\alpha \in (0,1]. Compare the three targets we have now for the same quantity vπ(St)\valuefn_\policy(S_t): Monte Carlo uses the full sampled return GtG_t (no bootstrap, needs the episode to end); dynamic programming uses the model expectation (TπV)(St)(\bellman^\policy V)(S_t) (full bootstrap, needs the model); TD(0) uses Rt+1+γV(St+1)R_{t+1} + \discount V(S_{t+1})one sampled successor plus a bootstrap off the current estimate.

It updates every step, online, and never references p\transition.

TD(0) as a stochastic Euler step

Why should nudging toward a bootstrapped (hence biased, while VvπV \ne \valuefn_\policy) target converge to the right answer? Because in expectation the nudge is a Bellman-operator step.

Proposition 4.1 (Expected TD(0) update).

For any value estimate VV and any state ss, the expected TD error under π\policy is the Bellman expectation residual,

Eπ ⁣[δtSt=s]=(TπV)(s)V(s).\E_\policy\!\big[\,\delta_t \,\big|\, S_t = s\,\big] = (\bellman^\policy V)(s) - V(s).

Hence the expected TD(0) update moves V(s)V(s) a fraction α\alpha of the way toward (TπV)(s)(\bellman^\policy V)(s).

Proof.

Condition on St=sS_t = s and average over the action and transition drawn under π\policy:

Eπ[δtSt=s]=Eπ[Rt+1+γV(St+1)St=s]V(s)(V(s) is deterministic given s)=aπ(as)s,rp(s,rs,a)[r+γV(s)]V(s)(expand the expectation)=(TπV)(s)V(s)(definition of Tπ, Ch. 1 Def. 1.4).\begin{aligned} \E_\policy[\delta_t \mid S_t = s] &= \E_\policy[\, R_{t+1} + \discount V(S_{t+1}) \mid S_t = s\,] - V(s) && \text{($V(s)$ is deterministic given $s$)} \\ &= \sum_a \policy(a\mid s)\sum_{s',r}\transition(s',r\mid s,a)\big[r + \discount V(s')\big] - V(s) && \text{(expand the expectation)} \\ &= (\bellman^\policy V)(s) - V(s) && \text{(definition of }\bellman^\policy\text{, Ch. 1 Def. 1.4).} \end{aligned}

So E[V(s)+αδt]=V(s)+α[(TπV)(s)V(s)]\E[\,V(s) + \alpha\delta_t\,] = V(s) + \alpha\big[(\bellman^\policy V)(s) - V(s)\big]. \qquad\blacksquare

Read the deterministic part as a forward-Euler step on the ODE V˙=TπVV\dot V = \bellman^\policy V - V, whose unique equilibrium is the fixed point Tπvπ=vπ\bellman^\policy \valuefn_\policy = \valuefn_\policy — that is, vπ\valuefn_\policy.

TD(0) follows this flow with stochastic targets, so it is a Robbins–Monro stochastic-approximation scheme: under the step-size conditions tαt=\sum_t \alpha_t = \infty, tαt2<\sum_t \alpha_t^2 < \infty and every state visited infinitely often, VvπV \to \valuefn_\policy Sutton & Barto (2018) Bertsekas (2017) . The contraction of Chapter 1 supplies the stability the ODE needs; sampling supplies the variance that the diminishing αt\alpha_t averages away.

SARSA: on-policy control

Prediction estimates vπ\valuefn_\policy; control needs action-values. SARSA forms the same TD error on QQ from the quintuple (St,At,Rt+1,St+1,At+1)(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}) — its namesake — where At+1A_{t+1} is the action the agent actually takes next under its (typically ε\varepsilon-greedy) policy:

Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)].Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha\big[\,R_{t+1} + \discount Q(S_{t+1},A_{t+1}) - Q(S_t,A_t)\,\big].

Because the bootstrap uses the value of the sampled next action, SARSA is on-policy: it evaluates and improves the very policy it follows. Interleaving this evaluation with ε\varepsilon-greedy improvement is generalized policy iteration (Chapter 1) driven by sampled errors; under GLIE (greedy in the limit with infinite exploration — ε0\varepsilon \to 0 slowly) it converges to q\optqfn.

Q-learning: off-policy control

Q-learning Watkins & Dayan (1992) changes one symbol — the next action is replaced by the greedy one:

Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)Q(St,At)].Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha\big[\,R_{t+1} + \discount \max_{a'} Q(S_{t+1},a') - Q(S_t,A_t)\,\big].

That max\max makes the target the sampled optimality backup, so Q-learning learns q\optqfn directly while behaving by any sufficiently exploratory policy — it is off-policy. Watkins and Dayan proved QqQ \to \optqfn with probability one under the Robbins–Monro step sizes and infinite visits to every state–action pair. The contrast with SARSA is exactly the contrast between the two operators of Chapter 1, now on action-values.

Proposition 4.2 (SARSA and Q-learning target the two Bellman operators).

With the action-value operators (TπQ)(s,a)=s,rp(s,rs,a)[r+γaπ(as)Q(s,a)](\bellman^\policy Q)(s,a) = \sum_{s',r}\transition(s',r\mid s,a)[r + \discount\sum_{a'}\policy(a'\mid s')Q(s',a')] and (TQ)(s,a)=s,rp(s,rs,a)[r+γmaxaQ(s,a)](\bellmanopt Q)(s,a) = \sum_{s',r}\transition(s',r\mid s,a)[r + \discount\max_{a'}Q(s',a')], the expected one-step targets are

E[SARSA targetSt=s,At=a]=(TπQ)(s,a),E[Q-learning targetSt=s,At=a]=(TQ)(s,a).\E[\,\text{SARSA target}\mid S_t=s,A_t=a] = (\bellman^\policy Q)(s,a), \qquad \E[\,\text{Q-learning target}\mid S_t=s,A_t=a] = (\bellmanopt Q)(s,a).

So SARSA’s fixed point is qπ\qfn_\policy (for the policy it follows) and Q-learning’s is q\optqfn.

Proof.

Average each target over (St+1,Rt+1)(S_{t+1},R_{t+1}) drawn from p(s,a)\transition(\cdot\mid s,a). SARSA additionally averages At+1π(St+1)A_{t+1}\sim\policy(\cdot\mid S_{t+1}), producing aπ(as)Q(s,a)\sum_{a'}\policy(a'\mid s')Q(s',a') inside the bracket — exactly Tπ\bellman^\policy. Q-learning takes maxaQ(s,a)\max_{a'}Q(s',a') deterministically, producing T\bellmanopt. A stochastic-approximation step toward a γ\discount-contraction’s value converges to its unique fixed point (Prop. 4.1’s argument, now for the action-value operators): qπ\qfn_\policy for Tπ\bellman^\policy, q\optqfn for T\bellmanopt. \qquad\blacksquare

On-policy vs off-policy: the cliff

The split is not academic. On the cliff-walking gridworld — a corridor whose bottom edge is a cliff that costs a large penalty and resets the agent — an optimal agent walks right along the cliff’s lip, the shortest path. Q-learning, estimating q\optqfn, learns exactly that risky-optimal route. SARSA, estimating qπ\qfn_\policy for the ε\varepsilon-greedy policy it actually follows, accounts for the fact that exploration will occasionally step it off the cliff, so it prefers a safer path one row up.

Both are correct about different questions — and at a fixed ε\varepsilon, the start-state value Q-learning reports is the higher (optimal) one, while SARSA’s is lower because it embeds the cost of exploration. The companion measures exactly this gap.

n-step TD and the bias–variance dial

TD(0) and Monte Carlo are the endpoints of one family. The nn-step return Gt:t+n=Rt+1+γRt+2++γn1Rt+n+γnV(St+n)G_{t:t+n} = R_{t+1} + \discount R_{t+2} + \cdots + \discount^{n-1}R_{t+n} + \discount^{n} V(S_{t+n}) bootstraps after nn sampled rewards; n=1n=1 is TD(0), n=n=\infty is Monte Carlo. Larger nn samples more (higher variance, less bootstrap bias); smaller nn bootstraps more (lower variance, more bias while VV is wrong). Intermediate nn — and its geometric average TD(λ\lambda) — usually beats both endpoints, the bias–variance dial made adjustable.

The dynamic-programming bridge

Temporal-difference learning closes a loop opened in Chapter 2. Real-time DP applied asynchronous Bellman backups along trajectories, but using the model expectation sp(ss,a)\sum_{s'}\transition(s'\mid s,a). Replace that expectation with a single sampled successor and the deterministic backup becomes the stochastic TD update — RTDP’s “Monte-Carlo cousin,” promised in Chapter 2, is precisely Q-learning. The three methods now form a clean table on two axes:

  • Dynamic programming — bootstrap, no sample (model expectation, full sweep).
  • Monte Carlo — sample, no bootstrap (full return, episode end).
  • Temporal difference — sample and bootstrap (one transition, online).

All three chase the same fixed point of the same γ\discount-contraction; they differ only in how much they sample and how much they bootstrap.

What’s next

  • Week 5 replaces the value table with a parametric approximator VθV_\theta. Bootstrapping (this chapter), off-policy data (Q-learning), and function approximation together form the deadly triad — the first place the clean contraction story breaks, because approximation moves the fixed point itself.

Exercises

  1. (Derive) Show Eπ[δtSt=s]=(TπV)(s)V(s)\E_\policy[\delta_t \mid S_t = s] = (\bellman^\policy V)(s) - V(s) for any VV, and conclude TD(0) is a stochastic step toward vπ\valuefn_\policy (Prop. 4.1).

    Solution

    Eπ[δtSt=s]=Eπ[Rt+1+γV(St+1)St=s]V(s)=aπ(as)s,rp(s,rs,a)[r+γV(s)]V(s)=(TπV)(s)V(s)\E_\policy[\delta_t\mid S_t=s] = \E_\policy[R_{t+1}+\discount V(S_{t+1})\mid S_t=s] - V(s) = \sum_a\policy(a\mid s)\sum_{s',r}\transition(s',r\mid s,a)[r+\discount V(s')] - V(s) = (\bellman^\policy V)(s) - V(s). The expected update is therefore V(s)+α[(TπV)(s)V(s)]V(s) + \alpha[(\bellman^\policy V)(s) - V(s)], a damped step of the γ\discount-contraction whose fixed point is vπ\valuefn_\policy.

  2. (Prove) Show the expected Q-learning target is (TQ)(s,a)(\bellmanopt Q)(s,a) and the expected SARSA target is (TπQ)(s,a)(\bellman^\policy Q)(s,a), hence their fixed points are q\optqfn and qπ\qfn_\policy (Prop. 4.2).

    Solution

    Averaging over (St+1,Rt+1)p(s,a)(S_{t+1},R_{t+1})\sim\transition(\cdot\mid s,a): Q-learning’s maxaQ(St+1,a)\max_{a'}Q(S_{t+1},a') gives s,rp[r+γmaxaQ(s,a)]=(TQ)(s,a)\sum_{s',r}\transition[r+\discount\max_{a'}Q(s',a')] = (\bellmanopt Q)(s,a). SARSA additionally averages At+1π(St+1)A_{t+1}\sim\policy(\cdot\mid S_{t+1}), giving s,rp[r+γaπ(as)Q(s,a)]=(TπQ)(s,a)\sum_{s',r}\transition[r+\discount\sum_{a'}\policy(a'\mid s')Q(s',a')] = (\bellman^\policy Q)(s,a). The fixed points follow from the contraction of each operator.

  3. (Compute) With γ=0.9\discount = 0.9, α=0.1\alpha = 0.1, current V(s)=2V(s)=2, observed Rt+1=1R_{t+1}=1, V(s)=3V(s')=3: compute δt\delta_t and the updated V(s)V(s).

    Solution

    δt=1+0.932=1.7\delta_t = 1 + 0.9\cdot 3 - 2 = 1.7; V(s)2+0.11.7=2.17V(s)\leftarrow 2 + 0.1\cdot 1.7 = 2.17. The same numbers drive a SARSA or Q-learning update with Q(s,)Q(s',\cdot) in place of V(s)V(s') (the sampled-action value for SARSA, the max for Q-learning).

  4. (Implement) In the companion, verify TD(0) converges to the analytic vπ\valuefn_\policy on the random walk; that Q-learning (GLIE) recovers the optimal cliff-edge policy while SARSA settles on the safer, slightly longer route; and that at a fixed ε\varepsilon Q-learning’s start-state value is at least SARSA’s.

    Solution

    See experiments/python/week04/test_td.py: TD(0) matches dp.policy_evaluation within a sampling tolerance; Q-learning’s greedy policy attains the optimal start-state value from dp.value_iteration while SARSA’s greedy is a sensible but suboptimal goal-reaching route (the safe path), so the off-policy greedy value is \ge the on-policy one; and the fixed-ε\varepsilon start-state estimate of Q-learning is \ge SARSA’s — the optimism/realism gap.

  5. (Extend) Sweep nn in nn-step TD on the random walk and reproduce the characteristic U-shaped error-vs-nn curve (an intermediate nn beats both TD(0) and Monte Carlo).

    Solution

    Bootstrapping bias falls with nn while sampling variance rises; the sum is minimized at intermediate nn. The qualitative U-curve over nn (for a fixed α\alpha) is the standard Sutton & Barto result; the companion’s --nstep sweep reproduces it.

Companion code

The Week-4 companion lives at experiments/python/week04/ and reuses earlier weeks: the random walk from Week 3 (randomwalk.py) for prediction, and the Week-1 dp solvers as the exact oracle for both prediction and control.

  • cliffwalk.py — a self-contained cliff-walking MDP (the Sutton & Barto Example 6.6 layout: a cliff row that penalizes and resets) as a generic (P, R) array pair, so v\optvaluefn and the optimal policy come from dp.value_iteration.
  • td.py — the transition sampler plus td0_prediction, sarsa, and q_learning, all operating on a generic (P, R, terminals), with optional GLIE schedules (visit-count step sizes, decaying ε\varepsilon). Pure NumPy.
  • test_td.py — mathematical-correctness tests: TD(0) converges to the analytic vπ\valuefn_\policy; under GLIE Q-learning recovers the optimal start-state value while SARSA learns the safer, slightly longer route; and Q-learning’s fixed-ε\varepsilon start value is at least SARSA’s (the cliff’s on-/off-policy gap).
# core algorithms + correctness tests (pure NumPy, no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week04/test_td.py -q

# worked SARSA-vs-Q-learning comparison on the cliff (prints both learned paths)
PYTHONPATH=. python experiments/python/week04/cliffwalk.py
Part I · Foundations Week 5 Published fa.py test_fa.py

Function Approximation and the Deadly Triad

Replacing the value table with a parametric approximator: linear value functions, semi-gradient TD, and the projected Bellman operator. Why on-policy semi-gradient TD converges to a bounded-error fixed point, and why function approximation, bootstrapping, and off-policy training together can diverge — the deadly triad, witnessed by Baird's counterexample.

Function Approximation and the Deadly Triad

Where we are. Every method so far stored one number per state in a table. That ends the moment the state space is large or continuous: we must approximate the value function with a parametric model VθV_\theta and let it generalize across states. This chapter is the first where the clean contraction story of Chapters 1–4 breaks. Two results frame it. On-policy, semi-gradient TD still converges — but to a projected fixed point whose error is the representation’s projection error, amplified by 1/(1γ)1/(1-\discount): approximation changes the fixed point itself. Off-policy, the same update can diverge. The combination that breaks is named the deadly triad — function approximation, bootstrapping, and off-policy training — and Baird’s counterexample is its sharpest witness.

Linear approximation and semi-gradient TD

Replace the table VRSV \in \R^{\statespace} with a parametric VθV_\theta. We focus on the linear case, where a feature map ϕ:SRd\phi:\statespace\to\R^d (dSd \ll \lvert\statespace\rvert) gives

Vθ(s)=θϕ(s),θRd.V_\theta(s) = \theta^\top \phi(s), \qquad \theta \in \R^d .

The representable value functions form a dd-dimensional subspace {Φθ:θRd}\{\Phi\theta : \theta\in\R^d\} of RS\R^{\statespace}, where Φ\Phi stacks the feature rows. Learning now adjusts θ\theta, and a value learned at one state moves the values of all states sharing features — generalization, the whole point.

Definition 5.1 (Semi-gradient TD(0)).

Given a transition (St,Rt+1,St+1)(S_t, R_{t+1}, S_{t+1}), semi-gradient TD(0) updates

θ    θ+αδtϕ(St),δt=Rt+1+γθϕ(St+1)θϕ(St).\theta \;\leftarrow\; \theta + \alpha\,\delta_t\,\phi(S_t), \qquad \delta_t = R_{t+1} + \discount\,\theta^\top\phi(S_{t+1}) - \theta^\top\phi(S_t).

It is called semi-gradient because it treats the bootstrap target γθϕ(St+1)\discount\,\theta^\top\phi(S_{t+1}) as fixed — it does not differentiate the target with respect to θ\theta, so the update is not the true gradient of any fixed objective.

That last point is the crux of the chapter.

A true-gradient method on a fixed loss cannot diverge to infinity; semi-gradient TD can, because the “target” it descends toward moves with θ\theta.

The projected Bellman operator

Where does semi-gradient TD converge, when it does? Not to TπVθ\bellman^\policy V_\theta — that generally leaves the representable subspace and is not expressible as any θ\theta. The update implicitly projects it back.

Definition 5.2 (Projection and the TD fixed point).

Let μ\mu be a distribution over states and x,yμ=sμ(s)x(s)y(s)\langle x,y\rangle_\mu = \sum_s \mu(s)x(s)y(s) the weighted inner product, with norm μ\norm{\cdot}_\mu. The projection Πμ\Pi_\mu maps a value function to the nearest representable one, Πμx=argminθΦθxμ\Pi_\mu x = \arg\min_{\theta}\norm{\Phi\theta - x}_\mu. The TD fixed point is the VθV_\theta satisfying

Vθ=ΠμTπVθ,V_\theta = \Pi_\mu\,\bellman^\policy V_\theta ,

the fixed point of the composed projected Bellman operator ΠμTπ\Pi_\mu\bellman^\policy.

Semi-gradient TD is a stochastic-approximation scheme (Chapter 4) for this fixed point: in expectation, under the state-visitation distribution μ\mu its update drives θ\theta toward ΠμTπVθ\Pi_\mu\bellman^\policy V_\theta. Whether that iteration converges hinges entirely on whether ΠμTπ\Pi_\mu\bellman^\policy is a contraction — which depends on μ\mu.

When it works: on-policy convergence

Proposition 5.1 (On-policy convergence and the projection-error bound).

Let μ\mu be the on-policy stationary distribution of π\policy (so μPπ=μ\mu^\top P_\policy = \mu^\top). Then Tπ\bellman^\policy is a γ\discount-contraction in μ\norm{\cdot}_\mu, and Πμ\Pi_\mu is nonexpansive in μ\norm{\cdot}_\mu, so ΠμTπ\Pi_\mu\bellman^\policy is a γ\discount-contraction with a unique fixed point VθV_{\theta^\star}. Its error is the projection error, amplified by the horizon:

Vθvπμ    11γΠμvπvπμ.\norm{V_{\theta^\star} - \valuefn_\policy}_\mu \;\le\; \frac{1}{1-\discount}\, \norm{\Pi_\mu \valuefn_\policy - \valuefn_\policy}_\mu .
Proof.

Two facts compose. (i) Under the stationary μ\mu, the transition operator is nonexpansive in μ\norm{\cdot}_\mu:

Pπxμ2=sμ(s)(sPπ(ss)x(s))2(definition)sμ(s)sPπ(ss)x(s)2(Jensen; Pπ(s) a distribution)=s(sμ(s)Pπ(ss))x(s)2=sμ(s)x(s)2=xμ2(μPπ=μ).\begin{aligned} \norm{P_\policy x}_\mu^2 &= \sum_s \mu(s)\Big(\sum_{s'}P_\policy(s'\mid s)\,x(s')\Big)^2 && \text{(definition)} \\ &\le \sum_s \mu(s)\sum_{s'}P_\policy(s'\mid s)\,x(s')^2 && \text{(Jensen; }P_\policy(\cdot\mid s)\text{ a distribution)} \\ &= \sum_{s'}\Big(\sum_s \mu(s)P_\policy(s'\mid s)\Big)x(s')^2 = \sum_{s'}\mu(s')\,x(s')^2 = \norm{x}_\mu^2 && \text{(}\mu^\top P_\policy = \mu^\top\text{).} \end{aligned}

Since TπxTπy=γPπ(xy)\bellman^\policy x - \bellman^\policy y = \discount P_\policy(x-y), this gives TπxTπyμγxyμ\norm{\bellman^\policy x - \bellman^\policy y}_\mu \le \discount\norm{x-y}_\mu. (ii) Πμ\Pi_\mu is an orthogonal projection in ,μ\langle\cdot,\cdot\rangle_\mu, hence nonexpansive. So ΠμTπ\Pi_\mu\bellman^\policy is a γ\discount-contraction and has a unique fixed point VθV_{\theta^\star} (Banach, Ch. 1). For the bound, since Tπvπ=vπ\bellman^\policy\valuefn_\policy = \valuefn_\policy we have Πμvπ=ΠμTπvπ\Pi_\mu\valuefn_\policy = \Pi_\mu\bellman^\policy\valuefn_\policy, so

VθΠμvπμ=ΠμTπVθΠμTπvπμγVθvπμ.\norm{V_{\theta^\star} - \Pi_\mu\valuefn_\policy}_\mu = \norm{\Pi_\mu\bellman^\policy V_{\theta^\star} - \Pi_\mu\bellman^\policy\valuefn_\policy}_\mu \le \discount\norm{V_{\theta^\star} - \valuefn_\policy}_\mu .

With the triangle inequality VθvπμVθΠμvπμ+Πμvπvπμ\norm{V_{\theta^\star}-\valuefn_\policy}_\mu \le \norm{V_{\theta^\star}-\Pi_\mu\valuefn_\policy}_\mu + \norm{\Pi_\mu\valuefn_\policy - \valuefn_\policy}_\mu, substitute and solve to get the stated bound. \qquad\blacksquare

Two readings. First, convergence is conditional: it rests on μ\mu being the on-policy distribution, the one fact that makes PπP_\policy nonexpansive.

Second, the fixed point moved: even at convergence the error is the projection error Πμvπvπμ\norm{\Pi_\mu\valuefn_\policy - \valuefn_\policy}_\mu — what the feature space cannot represent — blown up by 1/(1γ)1/(1-\discount). With a table the projection error is zero and we recover vπ\valuefn_\policy exactly; with a poor feature space the fixed point can be far from vπ\valuefn_\policy. Tsitsiklis and Van Roy Tsitsiklis & Van Roy (1997) established exactly this convergence and bound for on-policy linear TD.

The deadly triad

Proposition 5.1 needed all three of its ingredients to be benign. Drop the on-policy assumption and the guarantee collapses. The instability requires the conjunction of three elements, each harmless alone — the deadly triad Sutton & Barto (2018) :

  1. Function approximation — a value model with fewer parameters than states.
  2. Bootstrapping — TD/DP targets that reuse current estimates (vs. Monte Carlo’s full returns).
  3. Off-policy training — updating from a distribution other than the policy’s own.

Any two are safe: tabular off-policy bootstrapping is Q-learning (Chapter 4, convergent); on-policy bootstrapping with approximation is Proposition 5.1; Monte Carlo with approximation and off-policy data is a genuine gradient method and is stable. All three together can make ΠμTπ\Pi_\mu\bellman^\policy — now with μ\mu the wrong (off-policy) distribution — an expansion, and the parameters diverge to infinity. Baird Baird (1995) built the canonical witness.

Example 5.1 (Baird's counterexample).

Seven states, a linear feature map of dimension eight, and zero reward everywhere — so vπ=0\valuefn_\policy = 0 is trivially representable (θ=0\theta = 0). The target policy always takes the “solid” action (to the seventh state); the behaviour policy explores all states (off-policy). Off-policy semi-gradient TD with uniform state weighting applies a linear update θk+1=(I+αA)θk\theta_{k+1} = (I + \alpha A)\theta_k whose matrix AA has an eigenvalue with positive real part. The weights diverge geometrically even though a perfect, zero-error solution sits at the origin.

Eight weight trajectories all growing in magnitude without bound over sweeps, on a log-scaled vertical axis.
Off-policy semi-gradient TD on Baird's counterexample (gamma = 0.99): every component of the linear weight vector diverges, and the value-error norm grows without bound, although the exact solution theta = 0 has zero error. Produced by experiments/python/week05/baird.py.

The lesson is not that approximation is hopeless but that the interaction is what bites: the projected operator’s contraction was an on-policy privilege, and off-policy data revokes it. The next chapter’s engineering — target networks and replay — exists precisely to tame this instability well enough to train deep off-policy value functions in practice.

Beyond semi-gradient: LSTD and true-gradient methods

Two routes around the difficulty are worth naming. Least-squares TD (LSTD) solves the linear TD fixed-point equations directly: accumulate A=tϕ(St)(ϕ(St)γϕ(St+1))A = \sum_t \phi(S_t)\big( \phi(S_t) - \discount\phi(S_{t+1})\big)^\top and b=tϕ(St)Rt+1b = \sum_t \phi(S_t)R_{t+1}, then set θ=A1b\theta^\star = A^{-1}b — no step size, far more sample-efficient per datum, converging to the same on-policy fixed point as Proposition 5.1. Gradient-TD methods (GTD, TDC) instead descend a true objective (the projected Bellman error), restoring convergence even off-policy at the cost of a second set of weights. Both are the subject of the roadmap’s extension; LSTD appears in the companion.

The dynamic-programming bridge

Function approximation is the first crack in the contraction spine. Chapters 1–4 chased the unique fixed point of a γ\discount-contraction; here the object becomes ΠμTπ\Pi_\mu\bellman^\policy, and its fixed point both moves (by the projection error) and, off-policy, may not exist as a stable limit at all. Two bridges forward:

  • To control (LQR, Ch. 13). The linear-quadratic regulator is the lucky case where the optimal value is exactly quadratic — representable with zero projection error — so the approximation never bites and the Riccati recursion converges cleanly. Most of control’s tractable cases are exactly this: a function class the true value lives inside.
  • To deep RL (Week 6). DQN keeps bootstrapping, off-policy data, and a (nonlinear) approximator — the full triad — and survives by engineering the instability down: a slowly-updated target network freezes the bootstrap target, and experience replay decorrelates and re-weights the update distribution toward something tamer.

What’s next

  • Week 6 builds DQN: the replay buffer and target network as direct countermeasures to the deadly triad, scaling approximate value learning to pixels.

Exercises

  1. (Derive) Write the semi-gradient TD(0) update for linear VθV_\theta and show it is not the gradient of 12E[δt2]\tfrac12\E[\delta_t^2]. Identify the missing term.

    Solution

    θ12δt2=δtθδt=δt(γϕ(St+1)ϕ(St))\nabla_\theta\tfrac12\delta_t^2 = \delta_t\nabla_\theta\delta_t = \delta_t\big( \discount\phi(S_{t+1}) - \phi(S_t)\big). Semi-gradient TD uses δtϕ(St)-\delta_t\phi(S_t) (ascending δtϕ(St)\delta_t\phi(S_t)), dropping the γδtϕ(St+1)\discount\delta_t\phi(S_{t+1}) term from differentiating the target. The dropped term is exactly what would make it a true gradient — and what, when present (residual-gradient methods), changes the fixed point and the stability.

  2. (Prove) Show ΠμTπ\Pi_\mu\bellman^\policy is a γ\discount-contraction in μ\norm{\cdot}_\mu when μ\mu is the on-policy distribution, and derive the projection-error bound (Prop. 5.1).

    Solution

    Tπ\bellman^\policy is a γ\discount-contraction in μ\norm{\cdot}_\mu because PπP_\policy is nonexpansive under the stationary μ\mu (Jensen + μPπ=μ\mu^\top P_\policy = \mu^\top); Πμ\Pi_\mu is nonexpansive as an orthogonal projection; the composition is a γ\discount-contraction. The bound follows from Πμvπ=ΠμTπvπ\Pi_\mu\valuefn_\policy = \Pi_\mu\bellman^\policy\valuefn_\policy and the triangle inequality, as in the proof above.

  3. (Compute) In Baird’s setup the update is θk+1=(I+αA)θk\theta_{k+1} = (I + \alpha A)\theta_k. What property of AA (or of I+αAI + \alpha A) causes divergence, and why does a small α\alpha not prevent it?

    Solution

    Divergence occurs iff I+αAI + \alpha A has spectral radius >1> 1, i.e. AA has an eigenvalue with positive real part (for small α>0\alpha>0, ρ(I+αA)>1\rho(I+\alpha A)>1 exactly then). Shrinking α\alpha only slows the geometric blow-up; it cannot flip the sign of the unstable eigenvalue. The companion computes AA‘s spectrum.

  4. (Implement) Verify on the companion that on-policy semi-gradient TD converges to (near) vπ\valuefn_\policy on the random walk, while off-policy semi-gradient TD on Baird’s counterexample diverges (the committed figure).

    Solution

    See experiments/python/week05/test_fa.py: on-policy semi-gradient TD with tabular-complete features matches dp.policy_evaluation within a sampling tolerance and keeps bounded weights; Baird’s off-policy update grows the weight norm without bound (the divergence the figure plots).

  5. (Extend) Implement LSTD and confirm it recovers the same on-policy fixed point as semi-gradient TD (and vπ\valuefn_\policy exactly when the features are tabular-complete).

    Solution

    LSTD solves Aθ=bA\theta = b with A=tϕ(St)(ϕ(St)γϕ(St+1))A = \sum_t\phi(S_t)(\phi(S_t)-\discount\phi(S_{t+1}))^\top, b=tϕ(St)Rt+1b=\sum_t\phi(S_t)R_{t+1}; its solution is the TD fixed point. With tabular-complete features the projection error is zero and θ\theta^\star reproduces vπ\valuefn_\policy, which the companion’s lstd asserts against the Week-1 linear solve.

Companion code

The Week-5 companion lives at experiments/python/week05/ and reuses the Week-3 random walk and the Week-1 dp.policy_evaluation oracle. It is the chapter’s two poles side by side: a convergent on-policy method and a divergent off-policy one. (The roadmap suggests tile coding on MountainCar-v0; the random walk is used here because it has a closed-form vπ\valuefn_\policy to test against — MountainCar tile-coding is a natural deferred showcase that lacks an exact oracle.)

  • fa.py — linear semi_gradient_td (constant step size with Polyak–Ruppert averaging) and lstd for a generic (P, R, policy) with an arbitrary feature matrix Φ\Phi. Pure NumPy.
  • baird.py — Baird’s seven-state counterexample: builds the feature matrix and the off-policy expected-update matrix AA, iterates to divergence, and (with --plot) renders the committed divergence figure.
  • test_fa.py — mathematical-correctness tests: on-policy semi-gradient TD converges to the analytic vπ\valuefn_\policy (and stays bounded); LSTD recovers vπ\valuefn_\policy with tabular-complete features; and Baird’s off-policy update diverges (the weight norm grows without bound while AA has an unstable eigenvalue).
# core algorithms + correctness tests (pure NumPy, no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week05/test_fa.py -q

# regenerate the committed Baird divergence figure
PYTHONPATH=. python experiments/python/week05/baird.py --plot
Part I · Foundations Week 6 Published dqn.py test_dqn.py

The DQN Family

Making approximate Q-learning stable enough for pixels: experience replay and target networks as direct countermeasures to the deadly triad, the overestimation bias that motivates Double DQN, and the dueling, prioritized, and Rainbow refinements. The engineering that scaled value-based RL to Atari.

The DQN Family

Where we are. Chapter 4 gave us Q-learning; Chapter 5 warned that combining bootstrapping, off-policy data, and function approximation — the deadly triad — can diverge. The deep Q-network is that exact combination (Q-learning, off-policy, with a neural-network approximator), and naively it does diverge. What made it work on Atari from pixels — first as a 2013 workshop result Mnih et al. (2013) , then at human level Mnih et al. (2015) — was not new theory but two pieces of engineering that tame the triad’s two instabilities: experience replay and a target network. This chapter is about that engineering — and the overestimation bias that the next refinement, Double DQN, corrects.

From Q-learning to a deep Q-network

Replace the Q-table with a network Qθ(s,a)Q_\theta(s,a) and fit it to the sampled Bellman optimality target. DQN minimizes, over transitions (s,a,r,s)(s,a,r,s') drawn from a replay buffer D\mathcal{D},

L(θ)=E(s,a,r,s)D[(r+γmaxaQθ(s,a)Qθ(s,a))2],L(\theta) = \E_{(s,a,r,s')\sim\mathcal{D}}\Big[\big(\, r + \discount \max_{a'} Q_{\theta^-}(s',a') - Q_\theta(s,a)\,\big)^2\Big],

where θ\theta^- are the target-network weights. As in Chapter 5 this is a semi-gradient method: we differentiate only Qθ(s,a)Q_\theta(s,a), treating the target as fixed. Two structural problems would sink naive online training, and DQN answers each:

  • Correlated data. Consecutive transitions in a trajectory are highly correlated, violating the i.i.d. assumption gradient descent leans on.
  • A moving target. The regression target uses the same network being updated, so every gradient step shifts the target it was chasing — the bootstrap instability of the triad.

Experience replay

The first fix stores each transition (s,a,r,s)(s,a,r,s') in a fixed-capacity buffer D\mathcal{D} and trains on uniformly sampled minibatches rather than the latest transition. Three benefits follow: sampling across many past episodes decorrelates the minibatch toward the i.i.d. regime; each transition is reused many times, improving sample efficiency; and the update distribution becomes the buffer’s mixture rather than the current policy’s trajectory, re-weighting away from the pathological off-policy distributions that drive divergence.

The idea is old — it is the model-free, sampled descendant of the prioritized sweeping of Chapter 2 — and its prioritized variant returns shortly.

Target networks

The second fix freezes the bootstrap. DQN keeps a separate copy QθQ_{\theta^-} of the network, holds it fixed while training QθQ_\theta, and refreshes θθ\theta^- \leftarrow \theta only every CC steps (a hard update) or by a slow Polyak average θτθ+(1τ)θ\theta^- \leftarrow \tau\theta + (1-\tau)\theta^- (a soft update). Between refreshes the target r+γmaxaQθ(s,a)r + \discount\max_{a'}Q_{\theta^-}(s',a') is a fixed function, so each interval is an ordinary supervised regression toward a stationary target — exactly the stability the moving-target triad destroyed.

The slower θ\theta^- moves, the more stable and the slower the learning — the central DQN tuning trade-off.

Overestimation and Double DQN

One bias survives even with replay and a target network: the max\max in the target overestimates. Because the network’s action-values are noisy estimates, taking their maximum is systematically too high.

Proposition 6.1 (The max overestimates).

Let Q^(s,)\widehat{Q}(s',\cdot) be unbiased estimates of the true values q(s,)q(s',\cdot), i.e. E[Q^(s,a)]=q(s,a)\E[\widehat{Q}(s',a)] = q(s',a) for each aa. Then

E[maxaQ^(s,a)]    maxaE[Q^(s,a)]=maxaq(s,a),\E\big[\max_{a}\widehat{Q}(s',a)\big] \;\ge\; \max_{a}\E\big[\widehat{Q}(s',a)\big] = \max_a q(s',a),

with strict inequality whenever two or more actions’ estimates have positive-variance overlap. The bootstrap target therefore inherits a positive bias.

Proof.

The function xmaxaxax \mapsto \max_a x_a is convex (a pointwise maximum of linear maps). Jensen’s inequality for a convex function gives E[maxaQ^(s,a)]maxaE[Q^(s,a)]\E[\max_a \widehat{Q}(s',a)] \ge \max_a \E[\widehat{Q}(s',a)], and the right side is maxaq(s,a)\max_a q(s',a) by unbiasedness. Jensen is an equality only where the convex function is affine along the distribution’s support; the max\max has a kink exactly where the maximizing action changes, so any noise that makes the argmax\arg\max random makes the inequality strict. \qquad\blacksquare

Double DQN Hasselt et al. (2016) removes most of this bias by decoupling action selection from evaluation: pick the next action with the online network but evaluate it with the target network,

yDouble=r+γQθ ⁣(s,arg maxaQθ(s,a)).y^{\text{Double}} = r + \discount\, Q_{\theta^-}\!\big(s',\, \argmax_{a'} Q_\theta(s',a')\big).

The selecting and evaluating estimates now have independent errors, so they no longer conspire to inflate the maximum — a one-line change that measurably improves scores.

Dueling, prioritized replay, and Rainbow

Three further refinements round out the family. The dueling architecture Wang et al. (2016) splits the network into a state-value stream V(s)V(s) and an advantage stream A(s,a)A(s,a), recombined as Q(s,a)=V(s)+(A(s,a)1AaA(s,a))Q(s,a) = V(s) + \big(A(s,a) - \tfrac{1}{\lvert\actionspace\rvert}\sum_{a'}A(s,a')\big), so the agent can learn a state is good without estimating every action precisely. Prioritized experience replay Schaul et al. (2016) samples transitions in proportion to their Bellman-residual magnitude — Chapter 2’s prioritized sweeping, now over replayed transitions — with importance weights to correct the sampling bias. Rainbow Hessel et al. (2018) combines six such improvements (double, dueling, prioritized replay, multi-step returns, distributional values, and noisy exploration) and ablates each, showing they are largely complementary.

From states to pixels

DQN’s headline result was learning from raw Atari frames on the Arcade Learning Environment Bellemare et al. (2013) . The pixel pipeline adds its own engineering: grayscale and downsample each frame, stack four frames so velocity is observable (a single frame is not Markov), skip frames, clip rewards to {1,0,+1}\{-1,0,+1\}, and read the stack with a convolutional network. None of this changes the algorithm — it changes the features the same loss is regressed on.

Reliable comparison across all this machinery is itself hard — reproducibility studies and standardized implementations Raffin et al. (2021) exist precisely because small details swing results, a theme Week 8 returns to.

The dynamic-programming bridge

DQN is the deadly triad (Chapter 5) survived by engineering rather than dissolved by theory. The regression target is still the sampled Bellman optimality backup of Chapter 1; the two additions each blunt one edge of the triad. Replay re-weights and decorrelates the update distribution — the off-policy edge — and is the model-free heir to prioritized sweeping (Chapter 2). The target network freezes the bootstrap operator for CC steps — the moving-target edge — converting divergent fixed-point chasing into a sequence of stationary regressions. The discount γ\discount that guaranteed convergence with a table now only bounds the per-interval target; stability is bought, not proved.

What’s next

  • Week 7 changes the object of optimization entirely: instead of learning values and acting greedily, policy-gradient methods parameterize and optimize the policy directly, sidestepping the max\max and its overestimation, and extending naturally to continuous actions.

Exercises

  1. (Prove) Show E[maxaQ^(s,a)]maxaq(s,a)\E[\max_a \widehat{Q}(s',a)] \ge \max_a q(s',a) for unbiased Q^\widehat{Q}, and state when the inequality is strict (Prop. 6.1).

    Solution

    maxa\max_a is convex, so Jensen gives E[maxaQ^(s,a)]maxaE[Q^(s,a)]=maxaq(s,a)\E[\max_a\widehat{Q}(s',a)] \ge \max_a\E[\widehat{Q}(s',a)] = \max_a q(s',a). It is strict whenever the noise makes the argmax\arg\max random (two actions’ estimates overlap with positive probability), because the max\max is non-affine across the kink where the maximizer switches.

  2. (Derive) Write the Double DQN target and explain why it reduces the overestimation of Proposition 6.1.

    Solution

    yDouble=r+γQθ(s,arg maxaQθ(s,a))y^{\text{Double}} = r + \discount Q_{\theta^-}(s', \argmax_{a'}Q_\theta(s',a')). Selection uses QθQ_\theta, evaluation uses QθQ_{\theta^-}; their estimation errors are (largely) independent, so the action chosen as best by one network is not automatically assigned an inflated value by the same network. The E[max]\E[\max] bias becomes the much smaller bias of evaluating a possibly-suboptimal action.

  3. (Compute) A replay buffer of capacity 3 receives transitions t1,,t5t_1,\dots,t_5 in order. Which are stored after t5t_5, and why does a hard target update at step CC leave θ\theta^- momentarily equal to θ\theta?

    Solution

    A circular buffer of capacity 3 keeps the three most recent, {t3,t4,t5}\{t_3,t_4,t_5\} (t1,t2t_1,t_2 overwritten). A hard update copies θθ\theta^- \leftarrow \theta, so immediately after step CC the two networks are identical; they diverge again as θ\theta updates over the next CC steps while θ\theta^- is held fixed.

  4. (Implement) In the companion, verify the replay buffer, target-update, and TD-target components, and that DQN learns CartPole well above the random-return baseline within a fixed step budget.

    Solution

    See experiments/python/week06/test_dqn.py: circular-buffer overwrite and sample shapes; hard/soft target updates; the done-masked TD target and the Double-DQN selection/evaluation split; the empirical overestimation of the plain max; and a seeded CartPole run whose mean return rises far above the ~22 random baseline.

  5. (Extend) Add Double and Dueling variants and measure the overestimation gap (the plain-max target minus the Double target) over training.

    Solution

    The companion exposes double and dueling flags; the plain-max target sits above the Double target early in training (when value estimates are noisiest) and the gap shrinks as the network sharpens — the empirical face of Proposition 6.1.

Companion code

The Week-6 companion lives at experiments/python/week06/ and is the chapter’s first PyTorch code. Its correctness suite follows the repo’s deep-RL convention: fast, deterministic component tests for the pieces, plus a seeded simple-environment convergence check — heavy pixel environments are a deferred @slow showcase, not a graded test, in line with the 8 GB GPU budget.

  • dqn.py — a minimal DQN on CartPole-v1: a circular ReplayBuffer, an MLP QNetwork (and a DuelingQNetwork), an exposed td_target (plain and Double), hard/soft target updates, and the training loop, with double/dueling flags.
  • test_dqn.py — component-correctness tests (replay overwrite + sample shapes; target-network hard/soft updates; the done-masked Bellman target; the Double-DQN selection/evaluation split; the max-overestimation of Prop. 6.1) plus a seeded CartPole run asserting the mean return clears the random baseline by a wide margin.
# component tests + a seeded CartPole learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week06/test_dqn.py -q

# worked CartPole training run (prints the learning curve summary)
PYTHONPATH=. python experiments/python/week06/dqn.py --double --episodes 400
Part I · Foundations Week 7 Published reinforce.py test_reinforce.py

Policy Gradient Foundations

Optimizing a parameterized stochastic policy directly by gradient ascent on expected return: the policy gradient theorem via the log-derivative trick, REINFORCE, and baselines as variance-reducing control variates. Policy gradients as Monte Carlo sensitivity analysis — and the advantage that bridges to actor-critic.

Policy Gradient Foundations

Where we are. Every method so far learned a value and acted greedily — the arg max\argmax of Chapters 4–6. Policy-gradient methods discard that indirection: they parameterize the policy πθ(as)\policy_\theta(a\mid s) and ascend the gradient of expected return directly. This sidesteps the max\max and its overestimation (Chapter 6), handles stochastic policies and continuous actions natively, and rests on one identity — the log-derivative trick — that turns “differentiate an expectation you can only sample” into “weight samples by the score θlogπθ\nabla_\theta\log\policy_\theta.” The roadmap’s framing is exact: policy gradients are Monte Carlo sensitivity analysis.

The objective and the score

Let the policy be πθ(as)\policy_\theta(a\mid s), differentiable in θ\theta. A trajectory τ=(S0,A0,R1,)\tau = (S_0, A_0, R_1, \dots) has return G(τ)=t0γtRt+1\return(\tau) = \sum_{t\ge 0}\discount^t R_{t+1}, and the objective is its expectation,

J(θ)Eτπθ ⁣[G(τ)].J(\theta) \defeq \E_{\tau\sim\policy_\theta}\!\big[\,\return(\tau)\,\big].

We cannot differentiate JJ by differentiating the reward — the dependence on θ\theta is through the sampling distribution of τ\tau, not the integrand. The score function θlogπθ(as)\nabla_\theta\log\policy_\theta(a\mid s) is what carries that dependence, via one identity.

The policy gradient theorem

The gradient of JJ has a famously clean form, due to Sutton et al. Sutton et al. (2000) : an expectation of the score weighted by the action-value, with no derivative of the unknown dynamics anywhere in it.

Theorem 7.1 (Policy gradient theorem).

The gradient of the expected return is

θJ(θ)=Eτπθ ⁣[G(τ)t0θlogπθ(AtSt)]=Eπθ ⁣[t0θlogπθ(AtSt)qπθ(St,At)].\nabla_\theta J(\theta) = \E_{\tau\sim\policy_\theta}\!\Big[\, \return(\tau)\sum_{t\ge 0}\nabla_\theta\log\policy_\theta(A_t\mid S_t)\,\Big] = \E_{\policy_\theta}\!\Big[\sum_{t\ge 0}\nabla_\theta\log\policy_\theta(A_t\mid S_t)\,\qfn_{\policy_\theta}(S_t,A_t)\Big].

No derivative of the dynamics or the reward appears — only the score of the policy.

Proof.

Write J(θ)=pθ(τ)G(τ)dτJ(\theta) = \int p_\theta(\tau)\,\return(\tau)\,d\tau with trajectory density pθ(τ)=p(s0)tπθ(atst)p(st+1st,at)p_\theta(\tau) = p(s_0)\prod_t \policy_\theta(a_t\mid s_t)\,\transition(s_{t+1}\mid s_t,a_t). Differentiate and apply the log-derivative trick θpθ=pθθlogpθ\nabla_\theta p_\theta = p_\theta\nabla_\theta\log p_\theta:

θJ=θpθ(τ)G(τ)dτ(differentiate under the integral)=pθ(τ)θlogpθ(τ)G(τ)dτ(log-derivative trick)=Eτ ⁣[G(τ)θlogpθ(τ)](definition of expectation).\begin{aligned} \nabla_\theta J &= \int \nabla_\theta p_\theta(\tau)\,\return(\tau)\,d\tau && \text{(differentiate under the integral)} \\ &= \int p_\theta(\tau)\,\nabla_\theta\log p_\theta(\tau)\,\return(\tau)\,d\tau && \text{(log-derivative trick)} \\ &= \E_{\tau}\!\big[\,\return(\tau)\,\nabla_\theta\log p_\theta(\tau)\,\big] && \text{(definition of expectation).} \end{aligned}

In logpθ(τ)=logp(s0)+t[logπθ(atst)+logp(st+1st,at)]\log p_\theta(\tau) = \log p(s_0) + \sum_t\big[\log\policy_\theta(a_t\mid s_t) + \log\transition(s_{t+1}\mid s_t,a_t)\big] the initial-state and dynamics terms do not depend on θ\theta, so θlogpθ(τ)=tθlogπθ(atst)\nabla_\theta\log p_\theta(\tau) = \sum_t\nabla_\theta\log \policy_\theta(a_t\mid s_t)the model need not be known or differentiated. That gives the first form. Using causality (an action cannot influence past rewards, E[θlogπθ(AtSt)Rt]=0\E[\nabla_\theta\log\policy_\theta(A_t\mid S_t)R_{t'}] = 0 for ttt' \le t) replaces G(τ)\return(\tau) by the return-to-go, whose conditional expectation is qπθ(St,At)\qfn_{\policy_\theta}(S_t,A_t), giving the second. \qquad\blacksquare

The estimator is Monte Carlo sensitivity analysis: it estimates θE[]\nabla_\theta\E[ \cdot] from samples without differentiating the sampled function, by reweighting each sample by its score.

It is unbiased and, being Monte Carlo (Chapter 3), high-variance — which the rest of the chapter attacks.

REINFORCE

Sampling the theorem’s expectation gives the REINFORCE algorithm Williams (1992) : roll out episodes under πθ\policy_\theta, form the returns-to-go Gt\return_t, and ascend

θJ^=tθlogπθ(AtSt)Gt,θθ+αθJ^.\widehat{\nabla_\theta J} = \sum_t \nabla_\theta\log\policy_\theta(A_t\mid S_t)\,\return_t, \qquad \theta \leftarrow \theta + \alpha\,\widehat{\nabla_\theta J}.

It is unbiased and model-free, the policy-space counterpart of Monte Carlo value estimation — and inherits Monte Carlo’s variance.

The single most effective fix is a baseline.

Baselines as control variates

Proposition 7.1 (A state baseline is unbiased).

For any function b(s)b(s) that does not depend on the action,

Eπθ ⁣[θlogπθ(AtSt)b(St)]=0,\E_{\policy_\theta}\!\big[\,\nabla_\theta\log\policy_\theta(A_t\mid S_t)\,b(S_t)\,\big] = 0,

so subtracting b(St)b(S_t) from the return weight in the policy gradient leaves it unbiased, while choosing bb to track the typical return reduces its variance.

Proof.

Condition on St=sS_t = s and average the score over actions:

Eaπθ(s) ⁣[θlogπθ(as)]=aπθ(as)θπθ(as)πθ(as)=aθπθ(as)=θaπθ(as)=θ1=0.\E_{a\sim\policy_\theta(\cdot\mid s)}\!\big[\nabla_\theta\log\policy_\theta(a\mid s)\big] = \sum_a \policy_\theta(a\mid s)\,\frac{\nabla_\theta\policy_\theta(a\mid s)}{\policy_\theta(a\mid s)} = \sum_a \nabla_\theta\policy_\theta(a\mid s) = \nabla_\theta\sum_a\policy_\theta(a\mid s) = \nabla_\theta 1 = 0.

Hence E[θlogπθ(AtSt)b(St)]=E[b(St)0]=0\E[\nabla_\theta\log\policy_\theta(A_t\mid S_t)\,b(S_t)] = \E[b(S_t)\cdot 0] = 0: subtracting b(St)b(S_t) changes the gradient’s variance but not its mean. The variance-minimizing choice makes the weight a centered quantity; taking b(s)=vπθ(s)b(s) = \valuefn_{\policy_\theta}(s) turns the weight into the advantage A(s,a)=q(s,a)v(s)\advantage(s,a) = \qfn(s,a) - \valuefn(s). \qquad\blacksquare

A baseline is precisely a control variate: a zero-mean term subtracted to shrink variance without moving the estimate.

With b=vb = \valuefn, the policy gradient becomes E[tθlogπθ(AtSt)A(St,At)]\E[\sum_t \nabla_\theta\log\policy_\theta(A_t\mid S_t)\,\advantage(S_t,A_t)] — the form every actor-critic method (Week 8) estimates. Learning the baseline v\valuefn is what makes it a critic.

The dynamic-programming bridge

Policy gradients invert the value-based pattern. Chapters 4–6 solved a Bellman fixed point and read off a greedy policy; here the policy is the primal object optimized by gradient ascent, and value functions re-enter only as the baseline/critic that tames variance. Three threads carry forward:

  • To actor-critic (Week 8). Learn v\valuefn (or the advantage directly) as the baseline; the policy is the actor, the value the critic. Generalized advantage estimation tunes the bias–variance of A\advantage, and trust regions (PPO/TRPO) control the ascent step size in policy space.
  • To continuous control (Week 9). No arg max\argmax over actions is ever taken, so continuous action spaces are immediate — the setting where DDPG, TD3, and SAC live.
  • To optimal control (Part II). Differentiating an expected cost through the dynamics is the discrete, stochastic cousin of the adjoint/Hamiltonian sensitivity of Pontryagin’s principle — direct policy search is trajectory optimization with the model replaced by samples.

What’s next

  • Week 8 learns the baseline as a critic (actor-critic), tunes the advantage with generalized advantage estimation, and adds trust regions (TRPO/PPO) to bound the policy update — turning REINFORCE’s noisy ascent into a stable, sample-reusing optimizer.

Exercises

  1. (Derive) Derive the policy gradient theorem from J(θ)=Eτ[G(τ)]J(\theta) = \E_{\tau}[\return(\tau)] using the log-derivative trick, and show the dynamics terms drop (Theorem 7.1).

    Solution

    θJ=θpθ(τ)G(τ)dτ=Eτ[G(τ)θlogpθ(τ)]\nabla_\theta J = \int\nabla_\theta p_\theta(\tau)\return(\tau)d\tau = \E_\tau[ \return(\tau)\nabla_\theta\log p_\theta(\tau)]. Since logpθ(τ)\log p_\theta(\tau) splits into initial-state, policy, and dynamics terms and only the policy terms carry θ\theta, θlogpθ(τ)=tθlogπθ(atst)\nabla_\theta\log p_\theta(\tau) = \sum_t\nabla_\theta\log\policy_\theta(a_t \mid s_t) — the model cancels.

  2. (Prove) Show a state-dependent baseline b(s)b(s) leaves the policy gradient unbiased, i.e. E[θlogπθ(AS)b(S)]=0\E[\nabla_\theta\log\policy_\theta(A\mid S)\,b(S)] = 0 (Prop. 7.1).

    Solution

    Eaπθ(s)[θlogπθ(as)]=aθπθ(as)=θaπθ(as)=θ1=0\E_{a\sim\policy_\theta(\cdot\mid s)}[\nabla_\theta\log\policy_\theta(a\mid s)] = \sum_a\nabla_\theta\policy_\theta(a\mid s) = \nabla_\theta\sum_a\policy_\theta(a\mid s) = \nabla_\theta 1 = 0. Multiplying by b(s)b(s) (constant in aa) and taking the outer expectation over SS preserves the zero.

  3. (Compute) For a softmax policy πθ(as)exp(θaϕ(s))\policy_\theta(a\mid s) \propto \exp(\theta_a^\top\phi(s)), compute the score θjlogπθ(as)\nabla_{\theta_j}\log\policy_\theta(a \mid s).

    Solution

    θjlogπθ(as)=(1[j=a]πθ(js))ϕ(s)\nabla_{\theta_j}\log\policy_\theta(a\mid s) = \big(\mathbf{1}[j=a] - \policy_\theta(j\mid s)\big)\phi(s) — the feature, weighted by “taken minus probability.” Summed over aπθa\sim\policy_\theta this is zero (Prop. 7.1), the discrete face of the expected-score identity.

  4. (Implement) In the companion, verify the returns-to-go computation, that the expected score is zero (the baseline mechanism), that a baseline reduces gradient variance without bias, and that REINFORCE learns CartPole above the random baseline.

    Solution

    See experiments/python/week07/test_reinforce.py: a hand-checked discounted return-to-go; aπ(as)θlogπ(as)0\sum_a\policy(a\mid s)\nabla_\theta\log\policy(a\mid s)\approx 0 for a softmax network; a bandit where the baselined estimator has strictly lower variance; and a seeded CartPole run whose mean return clears the ~22 random baseline.

  5. (Extend) Add an entropy bonus βH(πθ(s))\beta\,\mathcal{H}(\policy_\theta(\cdot\mid s)) to the objective and explain its effect on exploration. (The roadmap’s JAX jax.grad variant is deferred to the dedicated JAX track.)

    Solution

    The entropy bonus rewards less-peaked policies, slowing premature convergence to a deterministic policy and sustaining exploration; its gradient adds βθH\beta\nabla_\theta \mathcal{H}, pushing toward higher-entropy distributions. It reappears, promoted from a bonus to the objective, in maximum-entropy RL (Week 9, SAC).

Companion code

The Week-7 companion lives at experiments/python/week07/ and is PyTorch REINFORCE on CartPole-v1, with and without a value baseline.

  • reinforce.py — a softmax PolicyNetwork, the discounted compute_returns (returns-to-go), and a REINFORCE training loop with an optional learned-value baseline and return normalization.
  • test_reinforce.py — component-correctness tests (hand-checked returns-to-go; the expected score aπlogπ=0\sum_a\policy\,\nabla\log\policy = 0 that underlies Prop. 7.1; a bandit where the baselined gradient estimator has strictly lower variance) plus a seeded CartPole run learning well above the random-return baseline.
# component tests + a seeded CartPole learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week07/test_reinforce.py -q

# worked REINFORCE training run (with the value baseline)
PYTHONPATH=. python experiments/python/week07/reinforce.py --baseline --episodes 800
Part I · Foundations Week 8 Published ppo.py test_ppo.py

Actor-Critic, GAE, PPO, and TRPO

Turning REINFORCE into a stable, sample-reusing optimizer: actor-critic with a learned baseline, generalized advantage estimation as a bias–variance dial, and trust regions (TRPO/PPO) as step-size control in policy space. Why the clipped surrogate works, and why implementation details decide the score.

Actor-Critic, GAE, PPO, and TRPO

Where we are. REINFORCE (Chapter 7) was unbiased but high-variance and strictly on-policy: one noisy gradient step per batch of fresh trajectories, then throw the data away. This chapter turns it into the workhorse of modern on-policy RL by adding three things: a learned critic as the baseline (actor-critic), generalized advantage estimation to tune the advantage’s bias–variance, and a trust region (TRPO/PPO) that bounds how far each update moves the policy — which is what finally lets a batch be reused for several epochs. The roadmap’s framing is the through-line: trust regions are step-size control in policy space.

Actor-critic

Chapter 7 ended with the advantage form of the policy gradient, θJ=E[tθlogπθ(AtSt)A(St,At)]\nabla_\theta J = \E[\sum_t \nabla_\theta\log\policy_\theta(A_t\mid S_t)\,\advantage(S_t,A_t)]. Actor-critic makes the baseline a learned function: an actor πθ\policy_\theta and a critic vϕ\valuefn_\phi trained to predict returns, with the advantage estimated from the critic.

Advantage actor-critic (A2C) updates both together — the critic by regressing vϕ\valuefn_\phi toward the returns, the actor by ascending the score weighted by the critic’s advantage. Konda and Tsitsiklis Konda & Tsitsiklis (2000) established the two-timescale convergence of actor-critic with a linear critic; the critic plays the role of approximate policy evaluation in a sampled, function-approximated generalized policy iteration.

Generalized advantage estimation

How should the advantage be estimated? The one-step TD residual δt=Rt+1+γvϕ(St+1)vϕ(St)\delta_t = R_{t+1} + \discount\valuefn_\phi(S_{t+1}) - \valuefn_\phi(S_t) is itself a low-variance, high-bias advantage estimate; the full Monte Carlo advantage Gtvϕ(St)\return_t - \valuefn_\phi(S_t) is the reverse. Generalized advantage estimation Schulman et al. (2016) interpolates with an exponential weighting.

Proposition 8.1 (The GAE recursion).

The generalized advantage estimator

AtGAE(γ,λ)l0(γλ)lδt+l\advantage^{\text{GAE}(\discount,\lambda)}_t \defeq \sum_{l\ge 0}(\discount\lambda)^l\,\delta_{t+l}

satisfies the backward recursion

AtGAE=δt+γλAt+1GAE.\advantage^{\text{GAE}}_t = \delta_t + \discount\lambda\,\advantage^{\text{GAE}}_{t+1}.

Its endpoints are λ=0\lambda=0, giving At=δt\advantage_t=\delta_t (low variance, high bias), and λ=1\lambda=1, giving At=l0γlδt+l=Gtvϕ(St)\advantage_t=\sum_{l\ge0}\discount^l\delta_{t+l}=\return_t-\valuefn_\phi(S_t) (the Monte Carlo advantage: high variance, low bias).

Proof.

Split the defining sum at its first term:

l0(γλ)lδt+l=δt+l1(γλ)lδt+l=δt+γλl0(γλ)lδt+1+l=δt+γλAt+1GAE.\sum_{l\ge 0}(\discount\lambda)^l\delta_{t+l} = \delta_t + \sum_{l\ge 1}(\discount\lambda)^l\delta_{t+l} = \delta_t + \discount\lambda\sum_{l\ge 0}(\discount\lambda)^l\delta_{t+1+l} = \delta_t + \discount\lambda\,\advantage^{\text{GAE}}_{t+1}.

For λ=1\lambda=1, l0γlδt+l\sum_{l\ge0}\discount^l\delta_{t+l} telescopes: writing δt+l=Rt+l+1+γvϕ(St+l+1)vϕ(St+l)\delta_{t+l} = R_{t+l+1}+\discount\valuefn_\phi(S_{t+l+1})-\valuefn_\phi(S_{t+l}), the value terms cancel in pairs, leaving l0γlRt+l+1vϕ(St)=Gtvϕ(St)\sum_{l\ge0}\discount^l R_{t+l+1} - \valuefn_\phi(S_t) = \return_t-\valuefn_\phi(S_t). \qquad\blacksquare

The recursion is what the companion computes in one backward pass.

Intermediate λ\lambda (commonly 0.95\approx 0.95) usually beats both endpoints — the same lesson as nn-step TD, now applied to the advantage that weights the policy gradient.

Trust regions: TRPO and PPO

REINFORCE and A2C take one gradient step per batch because the advantage estimate is only valid near the policy that generated the data; a large step can collapse performance. The fix is to bound the update — to control the step size in policy space rather than parameter space.

TRPO Schulman et al. (2015) makes this literal: maximize the importance-weighted surrogate E[ρtAt]\E[\rho_t\,\advantage_t], with ratio ρt=πθ(AtSt)/πold(AtSt)\rho_t = \policy_\theta(A_t\mid S_t)/\policy_{\text{old}}(A_t\mid S_t), subject to a trust-region constraint E[KL(πoldπθ)]δ\E[\mathrm{KL}(\policy_{\text{old}}\,\|\,\policy_\theta)] \le \delta. The KL ball is the trust region; within it the linearized objective is reliable.

PPO Schulman et al. (2017) replaces the hard constraint with a cheaper clipped surrogate,

LCLIP(θ)=E[min(ρtAt, clip(ρt,1ϵ,1+ϵ)At)].L^{\text{CLIP}}(\theta) = \E\big[\min\big(\rho_t\,\advantage_t,\ \mathrm{clip}(\rho_t, 1-\epsilon, 1+\epsilon)\,\advantage_t\big)\big].

The min\min makes LCLIPL^{\text{CLIP}} a pessimistic lower bound on the unclipped surrogate: when an advantage is positive, the objective stops rewarding increases in ρt\rho_t once ρt>1+ϵ\rho_t > 1+\epsilon (its gradient there is zero); when negative, it stops once ρt<1ϵ\rho_t < 1-\epsilon.

Either way there is no incentive to push the policy far from πold\policy_{\text{old}} in a single update, so PPO can safely take several epochs of minibatch updates per batch — the sample reuse REINFORCE lacked.

Implementation matters

A sobering empirical fact closes the family. Engstrom et al. Engstrom et al. (2020) showed that much of PPO’s advantage over TRPO comes not from the clipped objective but from code-level details — advantage normalization, value-function clipping, reward scaling, learning-rate annealing, orthogonal initialization — applied alongside it. Together with the broader reproducibility findings of Henderson et al. Henderson et al. (2018) , the lesson is that on-policy deep RL results must be read with the implementation, not just the algorithm, in view. The companion therefore tests the components (the GAE recursion, advantage normalization, the clip) and checks learning on a simple task, rather than chasing a benchmark number.

The dynamic-programming bridge

Actor-critic completes the generalized-policy-iteration picture under approximation. The critic is approximate policy evaluation (Chapter 1’s vπ\valuefn_\policy, learned from samples); the gradient actor is approximate improvement. Where policy iteration took the full greedy jump to arg maxaq\argmax_a \qfn, the actor takes a small step, and the trust region (TRPO’s KL ball, PPO’s clip) is exactly the step-size control that keeps improvement reliable when evaluation is noisy and local. Two bridges:

  • To continuous control (Week 9). Everything here is on-policy and stochastic; Week 9 turns to off-policy continuous control (DDPG, TD3, SAC), trading sample reuse-by-replay for the on-policy stability bought here by trust regions.
  • To optimal control (Part II). A trust region on the update is a regularized step — the policy-space analogue of a line search or a Levenberg–Marquardt damping in optimization, and a cousin of MPC’s receding horizon as a bound on how far ahead a single decision commits.

What’s next

  • Week 9 leaves on-policy methods for off-policy continuous control: the deterministic policy gradient (DDPG), its twin-critic fix (TD3), and maximum-entropy RL (SAC) — where the entropy bonus of Chapter 7 becomes the objective.

Exercises

  1. (Derive) Derive the GAE recursion AtGAE=δt+γλAt+1GAE\advantage^{\text{GAE}}_t = \delta_t + \discount\lambda\,\advantage^{\text{GAE}}_{t+1} from the exponential sum, and show λ=1\lambda=1 gives the Monte Carlo advantage (Prop. 8.1).

    Solution

    Split the sum at l=0l=0: l0(γλ)lδt+l=δt+γλl0(γλ)lδt+1+l=δt+γλAt+1GAE\sum_{l\ge0}(\discount\lambda)^l\delta_{t+l} = \delta_t + \discount\lambda\sum_{l\ge0}(\discount\lambda)^l\delta_{t+1+l} = \delta_t + \discount\lambda\advantage^{\text{GAE}}_{t+1}. At λ=1\lambda=1 the value terms in lγlδt+l\sum_l\discount^l\delta_{t+l} telescope to Gtvϕ(St)\return_t-\valuefn_\phi(S_t).

  2. (Prove) Show LCLIPL^{\text{CLIP}} is a lower bound on the unclipped surrogate E[ρtAt]\E[\rho_t\advantage_t], and identify where its gradient with respect to ρt\rho_t is zero.

    Solution

    min(ρA,clip(ρ)A)ρA\min(\rho\advantage, \mathrm{clip}(\rho)\advantage) \le \rho\advantage pointwise, so the expectation is a lower bound. For A>0\advantage>0 the clip caps the term at (1+ϵ)A(1+\epsilon)\advantage once ρ>1+ϵ\rho>1+\epsilon, where /ρ=0\partial/\partial\rho = 0; for A<0\advantage<0 it floors at (1ϵ)A(1-\epsilon)\advantage once ρ<1ϵ\rho<1-\epsilon, again with zero gradient. Inside [1ϵ,1+ϵ][1-\epsilon,1+\epsilon] the gradient is A\advantage.

  3. (Compute) With γ=0.99\discount=0.99, λ=0.95\lambda=0.95, and TD residuals δ=(1.0,0.5,0.2)\delta = (1.0, 0.5, -0.2) at the end of an episode (bootstrap zero), compute AGAE\advantage^{\text{GAE}}.

    Solution

    Backward: A2=0.2\advantage_2 = -0.2; A1=0.5+0.990.95(0.2)=0.3119\advantage_1 = 0.5 + 0.99\cdot0.95\cdot(-0.2) = 0.3119; A0=1.0+0.990.950.3119=1.2933\advantage_0 = 1.0 + 0.99\cdot0.95\cdot0.3119 = 1.2933. (The companion’s compute_gae reproduces these.)

  4. (Implement) In the companion, verify the GAE recursion matches the exponential sum and that λ=1\lambda=1 equals returns minus values; that advantage normalization produces mean-0/unit-variance advantages; that the PPO clip flattens the objective outside [1ϵ,1+ϵ][1-\epsilon,1+\epsilon]; and that PPO learns CartPole above the random baseline.

    Solution

    See experiments/python/week08/test_ppo.py: compute_gae vs the brute-force (γλ)lδ\sum(\discount\lambda)^l\delta; the λ=1\lambda=1 telescoping identity; the normalization statistics; the clip’s value/gradient outside the range; and a seeded PPO CartPole run clearing the ~22 random baseline.

  5. (Extend) Sweep the PPO clip range ϵ\epsilon and add KL early stopping; compare stability across seeds.

    Solution

    Smaller ϵ\epsilon tightens the trust region (more stable, slower); larger loosens it (faster, riskier). KL early stopping halts the epoch loop once the policy has moved a target KL from πold\policy_{\text{old}}, a direct enforcement of the trust region the clip only approximates — the companion’s --clip/--target-kl flags expose both.

Companion code

The Week-8 companion lives at experiments/python/week08/ and is PyTorch A2C and PPO on CartPole-v1, sharing one actor-critic network.

  • ppo.pycompute_gae (the backward recursion), an ActorCritic network, the ppo_clip_objective, and a PPO/A2C training loop with advantage normalization and a configurable clip range and epoch count.
  • test_ppo.py — component-correctness tests (compute_gae vs the closed-form exponential sum; the λ=1\lambda=1 = returns − values identity; advantage normalization; the PPO clip’s value and zero gradient outside [1ϵ,1+ϵ][1-\epsilon,1+\epsilon]) plus a seeded PPO CartPole run learning above the random baseline.
# component tests + a seeded CartPole learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week08/test_ppo.py -q

# worked PPO training run on CartPole
PYTHONPATH=. python experiments/python/week08/ppo.py --updates 150
Part I · Foundations Week 9 Published td3.py test_td3.py

Off-Policy Continuous Control: DDPG, TD3, and SAC

Off-policy actor-critic for continuous actions: the deterministic policy gradient (DDPG), the twin-critic overestimation fix (TD3), and maximum-entropy RL (SAC). Why the soft-optimal policy is Boltzmann in the action-value, and how maximum-entropy RL is KL-regularized optimal control — the bridge from learning to Part II.

Off-Policy Continuous Control: DDPG, TD3, and SAC

Where we are. Weeks 7–8 optimized stochastic policies on-policy. This chapter — the last of the RL foundations — turns to off-policy continuous control, where two ideas dominate the model-free baselines. First, the deterministic policy gradient (DDPG) makes continuous actions tractable by pushing the gradient through a differentiable critic instead of sampling them. Second, maximum-entropy RL (SAC) augments the reward with policy entropy, which both stabilizes learning and reveals a deep identity: maximum-entropy RL is KL-regularized optimal control — the bridge out of learning and into Part II. Between them, TD3 fixes the overestimation that Chapter 6 first diagnosed, now arising from the critic’s own bootstrap.

The deterministic policy gradient

With a continuous action space the stochastic policy gradient (Chapter 7) must average the score over actions — expensive and high-variance. A deterministic policy μθ:SA\mu_\theta:\statespace\to\actionspace avoids the action integral, and its gradient flows through the critic by the chain rule.

Theorem 9.1 (Deterministic policy gradient).

For a deterministic policy μθ\mu_\theta and its action-value qμθ\qfn_{\mu_\theta}, under mild regularity the gradient of the off-policy objective is

θJ(θ)=Esρ ⁣[θμθ(s)aqμθ(s,a)a=μθ(s)],\nabla_\theta J(\theta) = \E_{s\sim\rho}\!\Big[\,\nabla_\theta\mu_\theta(s)\,\nabla_a \qfn_{\mu_\theta}(s,a)\big|_{a=\mu_\theta(s)}\Big],

where ρ\rho is the (off-policy) state distribution. The policy is improved by pushing its output up the critic’s action-gradient.

This is the continuous-action analogue of greedy improvement: where a discrete agent takes arg maxaq\argmax_a \qfn, the deterministic actor takes a gradient step in aa toward larger q\qfn.

Silver et al. Silver et al. (2014) proved the theorem (as the zero-variance limit of the stochastic gradient); Lillicrap et al. Lillicrap et al. (2016) turned it into DDPG — the deterministic actor and critic trained off-policy with the replay buffer and target networks of Chapter 6, and exploration noise added to the actor’s output.

Overestimation returns: TD3

DDPG inherits the deadly-triad fragility and the overestimation bias of Chapter 6 — now produced by the critic bootstrapping on its own optimistic estimates. TD3 Fujimoto et al. (2018) applies three fixes, the first echoing Double DQN:

  1. Twin critics, take the minimum. Train two critics qϕ1,qϕ2\qfn_{\phi_1},\qfn_{\phi_2} and form the target with min(qϕ1,qϕ2)\min(\qfn_{\phi_1},\qfn_{\phi_2})clipped double Q-learning, a continuous cousin of Double DQN that caps the upward bias of Proposition 6.1.
  2. Delayed policy updates. Update the actor (and targets) less often than the critics, so the policy chases a more settled value.
  3. Target policy smoothing. Add clipped noise to the target action, regularizing the critic against sharp peaks it could otherwise exploit.

The first is the load-bearing one: taking the minimum of two independent critics systematically underestimates, which is far safer for a bootstrapped target than the overestimate the single-critic max produces.

Maximum-entropy RL and SAC

A different idea reshapes the objective itself. Maximum-entropy RL adds the policy’s entropy H(π(s))\mathcal{H}(\policy(\cdot\mid s)) to the reward, scaled by a temperature α\alpha:

J(π)=Eπ ⁣[tr(St,At)+αH(π(St))].J(\policy) = \E_\policy\!\Big[\sum_t \reward(S_t,A_t) + \alpha\,\mathcal{H}\big(\policy(\cdot\mid S_t)\big)\Big].

The agent is rewarded for acting well and for staying stochastic — sustaining exploration and robustness. Soft actor-critic Haarnoja et al. (2018) is the off-policy actor-critic for this objective, with an automatically tuned temperature Haarnoja et al. (2019) . The one-step soft problem has a clean optimum.

Proposition 9.1 (The soft-optimal policy is Boltzmann).

Maximizing aπ(as)q(s,a)+αH(π(s))\sum_a \policy(a\mid s)\,\qfn(s,a) + \alpha\,\mathcal{H}(\policy(\cdot\mid s)) over distributions π(s)\policy(\cdot\mid s) gives the Boltzmann policy

π(as)=exp ⁣(q(s,a)/α)aexp ⁣(q(s,a)/α),\policy^*(a\mid s) = \frac{\exp\!\big(\qfn(s,a)/\alpha\big)}{\sum_{a'}\exp\!\big(\qfn(s,a')/\alpha\big)},

with optimal soft value vsoft(s)=αlogaexp(q(s,a)/α)\valuefn^*_{\text{soft}}(s) = \alpha\log\sum_a\exp(\qfn(s,a)/\alpha). As α0\alpha\to0 this recovers the greedy arg max\argmax and the ordinary value.

Proof.

Write H(π)=aπ(as)logπ(as)\mathcal{H}(\policy)=-\sum_a\policy(a\mid s)\log\policy(a\mid s) and maximize aπ(as)[q(s,a)αlogπ(as)]\sum_a\policy(a\mid s)\big[\qfn(s,a)-\alpha\log\policy(a\mid s)\big] subject to aπ(as)=1\sum_a\policy(a\mid s)=1. The Lagrangian’s stationarity in π(as)\policy(a\mid s) gives q(s,a)αlogπ(as)αλ=0\qfn(s,a)-\alpha\log\policy(a\mid s)-\alpha-\lambda=0, so logπ(as)=q(s,a)/α+const\log\policy(a\mid s) = \qfn(s,a)/\alpha + \text{const}, i.e. π(as)exp(q(s,a)/α)\policy(a\mid s)\propto\exp(\qfn(s,a)/\alpha). Normalizing gives the Boltzmann form; substituting back gives vsoft(s)=αlogaexp(q(s,a)/α)\valuefn^*_{\text{soft}}(s)=\alpha\log\sum_a\exp(\qfn(s,a)/\alpha), the log-sum-exp “soft maximum.” As α0\alpha\to0 the soft max maxaq\to\max_a\qfn and π\policy^*\to greedy. \qquad\blacksquare

The entropy bonus of Chapter 7 has been promoted from a heuristic to the objective, and its optimum is a temperature-controlled softmax over the action-value.

The dynamic-programming bridge

Maximum-entropy RL is where learning rejoins control. The soft value αlogaexp(q/α)\alpha\log\sum_a\exp(\qfn/\alpha) is the optimal cost-to-go of a KL-regularized control problem: maximizing reward minus αKL(ππ0)\alpha\,\mathrm{KL}(\policy\,\|\,\policy_0) against a reference π0\policy_0 yields exactly the Boltzmann policy of Proposition 9.1, and Todorov’s linearly-solvable MDPs exploit precisely this — under the exponential transform z=exp(vsoft/α)z = \exp(\valuefn_{\text{soft}}/\alpha) the soft Bellman equation becomes linear. Three threads close Part I:

  • Continuous-action improvement is the deterministic actor (DPG) or the soft Boltzmann policy (SAC), replacing the discrete arg max\argmax — approximate policy iteration (Chapter 1) in a continuum.
  • Overestimation control (TD3’s twin-min) is the same bias management as Double DQN (Chapter 6), now load-bearing because the bootstrap runs through a critic.
  • To Part II. KL-regularized control, the maximum-entropy LQR with its closed form, and the deterministic limit (α0\alpha\to0) that is classical optimal control are the entry points to LQR (Week 13) and MPC (Week 15) — the model-based, deterministic side of the same fixed point.

What’s next

  • Week 10 steps back to ask where RL has actually worked in the real world — a survey of robotics successes (locomotion, manipulation, flight, plasma control) and the conditions that made them possible.
  • Part II (Week 11+) then changes register entirely, to control theory: stability, LQR, and model predictive control — met now from the RL side, and rejoined with it in Part III.

Exercises

  1. (Derive) Starting from J(θ)=Esρ[qμθ(s,μθ(s))]J(\theta)=\E_{s\sim\rho}[\qfn_{\mu_\theta}(s,\mu_\theta(s))], derive the deterministic policy gradient by the chain rule (Theorem 9.1).

    Solution

    θq(s,μθ(s))=θμθ(s)aq(s,a)a=μθ(s)\nabla_\theta\qfn(s,\mu_\theta(s)) = \nabla_\theta\mu_\theta(s)\,\nabla_a\qfn(s,a)|_{a=\mu_\theta(s)} by the chain rule (the explicit ss-dependence of q\qfn is held fixed); taking the expectation over sρs\sim\rho gives Theorem 9.1. The state distribution ρ\rho may be off-policy, which is why DDPG can learn from a replay buffer.

  2. (Prove) Show the maximum-entropy one-step optimal policy is π(as)exp(q(s,a)/α)\policy^*(a\mid s)\propto\exp(\qfn(s,a)/\alpha) with soft value αlogaexp(q(s,a)/α)\alpha\log\sum_a\exp(\qfn(s,a)/\alpha) (Prop. 9.1).

    Solution

    Maximize aπ(qαlogπ)\sum_a\policy(\qfn-\alpha\log\policy) under aπ=1\sum_a\policy=1; stationarity gives qαlogπαλ=0\qfn-\alpha\log\policy-\alpha-\lambda=0, so πexp(q/α)\policy\propto\exp(\qfn/\alpha). Substituting the normalized policy back yields the log-sum-exp soft value. The α0\alpha\to0 limit recovers the hard max\max.

  3. (Compute) A target state has twin-critic estimates qϕ1=2.0\qfn_{\phi_1}=2.0, qϕ2=1.4\qfn_{\phi_2}=1.4 for the target action. What value does TD3 use, and why is the minimum the safer choice for a bootstrap target?

    Solution

    TD3 uses min(2.0,1.4)=1.4\min(2.0,1.4)=1.4. The single-critic max/overestimate (Prop. 6.1) compounds through bootstrapping; taking the minimum of two independent estimates biases downward, and a slight underestimate does not amplify across the backup the way an overestimate does.

  4. (Implement) In the companion, verify the twin-critic minimum lowers the target versus a single critic; that target-policy-smoothing noise is clipped to range; the Polyak soft target update; and that minimal TD3 learns Pendulum above the random return.

    Solution

    See experiments/python/week09/test_td3.py: the clipped-double-Q target equals the per-sample minimum of the twin critics (≤ either); the smoothing noise and target action respect their clip bounds; the Polyak update matches its closed form; and a seeded TD3 run on Pendulum-v1 clears the random-return baseline by a wide margin.

  5. (Extend) Sweep the SAC temperature α\alpha and relate the limit α0\alpha\to0 (greedy) and large α\alpha (uniform). (The roadmap’s JAX/Brax SAC baseline is deferred to the dedicated JAX track.)

    Solution

    Small α\alpha concentrates the Boltzmann policy on the arg max\argmax (exploitation, recovering ordinary RL); large α\alpha flattens it toward uniform (maximal exploration). Automatic temperature tuning Haarnoja et al. (2019) adjusts α\alpha to hold a target entropy rather than fixing it by hand.

Companion code

The Week-9 companion lives at experiments/python/week09/ and is a minimal TD3 on Pendulum-v1 (the chapter’s testable centerpiece), with Stable-Baselines3 named as the reference baseline.

  • td3.py — a continuous-action ReplayBuffer, a deterministic Actor and twin Critics, the exposed td3_target (clipped double-Q with target-policy smoothing), Polyak soft_update, and the training loop. Pure PyTorch.
  • test_td3.py — component-correctness tests (the twin-critic minimum lowers the target; smoothing-noise and target-action clipping; the Polyak update’s closed form) plus a seeded Pendulum-v1 run learning well above the random return.
# component tests + a seeded Pendulum learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week09/test_td3.py -q

# worked minimal-TD3 training run on Pendulum
PYTHONPATH=. python experiments/python/week09/td3.py --steps 40000
Part I · Foundations Week 10 Published

Reinforcement Learning in the Real World: A Robotics Survey

Where reinforcement learning has actually worked on real hardware — dexterous manipulation, legged locomotion, champion drone racing, and tokamak plasma control — and the recipe behind those successes: abundant simulation, domain randomization, and RL reserved for where models are hard or adaptation is required. The empirical case for RL, and why it sets up Part II.

Reinforcement Learning in the Real World: A Robotics Survey

Where we are. Part I built the algorithms — dynamic programming, Monte Carlo and temporal difference, deep value learning, and policy-gradient and actor-critic methods through continuous control. This capstone asks the empirical question those chapters deferred: where has reinforcement learning actually worked on real hardware, and what conditions made it work? The honest answer is both encouraging and narrowing — a handful of genuine, high-profile successes that share a remarkably consistent recipe, and which together explain when to reach for RL and when classical control still wins. This is a reading-and-synthesis week: no new algorithm, a small evidence table, and the argument it supports.

The cases

Dexterous manipulation. OpenAI’s Rubik’s-Cube hand Akkaya et al. (2019) trained a Shadow Hand entirely in simulation and transferred to physical hardware via automatic domain randomization (ADR) — progressively widening the distribution of simulated physics (friction, masses, latencies) until the real robot looked like just another sample. The policy solved a Rubik’s cube one-handed despite never touching a real cube during training.

Legged locomotion. The most mature success story. Lee et al. Lee et al. (2020) trained a blind quadruped to traverse challenging terrain by teacher–student distillation: a privileged teacher with full state trains a proprioception-only student deployable on hardware. Miki et al. Miki et al. (2022) added exteroception, producing robust perceptive locomotion over stairs, gaps, and obstacles outdoors. Radosavovic et al. Radosavovic et al. (2024) carried the same sim-to-real recipe to a full-size humanoid, walking zero-shot in outdoor environments.

Agile flight. Kaufmann et al. Kaufmann et al. (2023) trained an RL policy that beat human world champions at physical first-person-view drone racing — a regime where split-second control at the edge of the dynamics envelope defeats hand-tuned controllers.

Scientific and industrial control. The standout non-locomotion case: Degrave et al. Degrave et al. (2022) used deep RL to magnetically control the plasma shape in a real tokamak, coordinating dozens of control coils against many simultaneous constraints — a problem where accurate first-principles control is genuinely hard and the payoff of learned control is large.

Human-in-the-loop methods (collecting corrective feedback during training) are an active frontier for precise manipulation, pushing RL toward sub-millimeter industrial tasks; the cross-cutting survey of deployed systems by Tang et al. Tang et al. (2025) catalogues these and the recurring sim-to-real patterns.

An evidence table

The roadmap’s deliverable for this week is an evidence table, not code — laying the cases side by side exposes the shared recipe.

| System | Task | Trained in | Real hardware | Sim-to-real / adaptation | Control baseline beaten | |---|---|---|---|---|---| | Rubik’s hand Akkaya et al. (2019) | dexterous manipulation | MuJoCo sim | Shadow Hand | automatic domain randomization | no prior model-based in-hand dexterity controller | | Quadruped Lee et al. (2020) | blind rough-terrain locomotion | rigid-body sim | ANYmal | teacher–student distillation | model-based gait controllers | | Quadruped (wild) Miki et al. (2022) | perceptive locomotion | rigid-body sim | ANYmal | proprio + exteroception, randomization | model-based perceptive control | | Humanoid Radosavovic et al. (2024) | bipedal walking | sim | full-size humanoid | zero-shot sim-to-real | model-based whole-body control | | Drone racing Kaufmann et al. (2023) | agile FPV flight | sim + identified dynamics | racing quadrotor | system identification + randomization | human world champions (and prior autonomous baselines) | | Tokamak Degrave et al. (2022) | plasma magnetic control | physics simulator | TCV tokamak | sim-to-real on a calibrated model | conventional multi-loop controllers |

The empirical case for RL

Read down the table and the same three conditions recur — the conditions under which RL is the right tool Tang et al. (2025) :

  1. Model accuracy is hard. Contact-rich locomotion, dexterous manipulation, and plasma dynamics resist accurate first-principles models. RL learns from interaction what is painful to derive.
  2. Adaptation is required. Terrain, disturbances, and hardware variation demand a policy robust across conditions; domain randomization turns that need into a training distribution the policy generalizes over.
  3. Simulation is abundant. Every case generates its training data in a fast, parallel simulator — the millions of samples RL needs are free there and ruinously expensive on hardware.

Where these fail to hold — no faithful simulator, safety-critical systems with no simulation budget, or problems with good analytic models — classical and optimal control remain the better choice. The successes are real but conditional, and naming the conditions is the point of the survey.

The simulation paradox, and the bridge to Part II

The deepest lesson is a paradox. These are triumphs of model-free policy learning — the deployed policy plans through no learned dynamics model — yet not one of them learned on hardware from scratch. Each trained inside a simulator (a model) and randomized it to cross the reality gap. Model-free RL conquers the physical world precisely by leaning on an abundant, deliberately imperfect model. Stated that way, the question Part II answers comes into focus: when you already have a good model, why learn around it? Control theory — Lyapunov stability, the LQR’s closed-form optimal feedback, model predictive control’s online re-optimization — exploits the model directly, with guarantees RL cannot offer. Part III then fuses the two: MPC that learns its model or cost, and RL warm-started or constrained by a controller. The robotics successes are where the model-free spine of Part I meets reality; the model-based spine of Part II is the other half of the same fixed point.

What’s next

  • Part II (Week 11+) changes register from learning to control theory: dynamical systems and Lyapunov stability, the linear-quadratic regulator as the exactly-solvable optimal control problem (the Bellman fixed point in closed form), and model predictive control as online approximate dynamic programming. The discount-contraction spine of Part I reappears as the Riccati equation and the receding horizon.

Exercises

  1. (Compute) Add one more deployed RL system to the evidence table (e.g. a warehouse, autonomous-driving, or data-center cooling deployment) and fill all six columns from its paper. Which of the three conditions does it satisfy?

    Solution

    Most deployed cases satisfy all three (hard model, adaptation, abundant sim); data-center cooling is the interesting partial case — a learned model substitutes for a hard-to-build simulator, stretching condition 3. The exercise is to make the classification explicit and defend it from the paper’s methods section.

  2. (Derive) From the six cases, state the three enabling conditions precisely, and give a concrete robotics task that fails at least one — predicting that classical control should win there.

    Solution

    A safety-critical task with no faithful simulator and a tight real-data budget (e.g. a one-off surgical manipulator) fails conditions 1 and 3: RL’s sample appetite cannot be met and the reality gap cannot be closed, so model-based / optimal control with formal guarantees is the appropriate tool.

  3. (Extend) Implement a tiny domain-randomization toy: train a policy (e.g. the Week-7 REINFORCE or Week-9 TD3 companion) on a Gymnasium env whose dynamics are randomized each episode (mass, gravity, or action scale), and measure its robustness to a held-out dynamics setting versus a policy trained without randomization.

    Solution

    Randomizing a physics parameter each reset trains a policy over a distribution of dynamics; on a held-out setting it should degrade far less than the non-randomized policy, reproducing in miniature the sim-to-real mechanism every case above relies on. This is the one optional code task of an otherwise reading-only week.

  4. (Extend) Pick one case and argue where a Part-II model-based controller (LQR or MPC) could replace or augment the learned policy. What would each approach require?

    Solution

    The tokamak is the natural candidate: a calibrated plasma model already exists, so an MPC could in principle re-optimize the coil currents online — at the cost of solving a constrained optimization every control step, which is exactly what RL amortizes into a fast policy. The trade is online compute and model fidelity (MPC) versus training cost and the reality gap (RL) — the Part III hybrid uses both.

Part: control Week 11 Published state_space.py test_state_space.py

State-Space Models and Transfer Functions

The entry to control theory: the linear state-space model, the transfer function, and their equivalence. Two views of one dynamical system, the state-similarity invariance that makes the transfer function the coordinate-free object, and the discrete-time model that is exactly the MDP dynamics control assumes known.

State-Space Models and Transfer Functions

Where we are. Part I learned policies and values from interaction, with the dynamics unknown or only sampled. Part II turns to control theory, which starts from the opposite end — an explicit model of the dynamics — and asks what optimal feedback that model permits. The foundational object is the linear state-space model h˙=Ah+Bu\dot{\statevec} = \statemat\statevec + \inputmat u, $y = \outputmat\statevec

  • \feedmat u,anditsfrequencydomaintwin,thetransferfunction, and its frequency-domain twin, the **transfer function** H(s) = \outputmat(s I - \statemat)^-1\inputmat + \feedmat$. They are two views of one system. This chapter sets up that object — and shows its discrete-time form is exactly the known MDP dynamics that LQR (Week 13) will solve in closed form.

The linear state-space model

Definition 11.1 (Linear time-invariant state-space model).

A continuous-time LTI system with state hRn\statevec\in\R^n, input uRmu\in\R^m, and output yRpy\in\R^p is

h˙=Ah+Bu,y=Ch+Du,\dot{\statevec} = \statemat\statevec + \inputmat u, \qquad y = \outputmat\statevec + \feedmat u,

with ARn×n\statemat\in\R^{n\times n}, BRn×m\inputmat\in\R^{n\times m}, CRp×n\outputmat\in\R^{p\times n}, DRp×m\feedmat\in\R^{p\times m}. Its discrete-time counterpart, sampled at step Δ\stepsize, is hk+1=Aˉhk+Bˉuk\statevec_{k+1} = \discA\statevec_k + \discB u_k, yk=Chk+Duky_k = \outputmat\statevec_k + \feedmat u_k.

The running example is the mass–spring–damper mq¨+cq˙+kq=um\ddot{q} + c\dot{q} + kq = u: taking the state h=(q,q˙)\statevec = (q, \dot q) gives A=(01k/mc/m)\statemat = \begin{pmatrix} 0 & 1 \\ -k/m & -c/m \end{pmatrix}, B=(01/m)\inputmat = \begin{pmatrix} 0 \\ 1/m \end{pmatrix}, and C=(10)\outputmat = \begin{pmatrix} 1 & 0 \end{pmatrix} if we measure position. A second-order mechanical law has become a first-order vector ODE — the form every method in Part II acts on.

The transfer function

Laplace-transforming the state equation at zero initial condition and eliminating the state gives the input–output map directly.

Definition 11.2 (Transfer function).

The transfer function of (A,B,C,D)(\statemat,\inputmat,\outputmat,\feedmat) is the rational matrix

H(s)=C(sIA)1B+D,H(s) = \outputmat\,(s I - \statemat)^{-1}\inputmat + \feedmat,

mapping input to output in the Laplace domain, Y(s)=H(s)U(s)Y(s) = H(s)\,U(s). Its poles are the eigenvalues of A\statemat (those not cancelled by zeros), and its zeros are the frequencies it blocks.

For the mass–spring–damper the single-input single-output transfer function is H(s)=1ms2+cs+kH(s) = \tfrac{1}{m s^2 + c s + k} — the poles are the roots of the characteristic polynomial, i.e. the eigenvalues of A\statemat, exactly the natural frequencies of the oscillator. Åström and Murray \AAström & Murray (2021) develop both views and their interplay; the state-space form carries the internal dynamics, the transfer function the external behaviour.

Two views, one system

The two representations are equivalent, with one asymmetry worth making precise: the transfer function is unique, but the state-space realization is not — any invertible change of state coordinates leaves the input–output behaviour untouched.

Proposition 11.1 (State-similarity invariance).

For any invertible TRn×nT\in\R^{n\times n}, the coordinate change h=Tz\statevec = T z produces the system (T1AT,  T1B,  CT,  D)(\,T^{-1}\statemat T,\; T^{-1}\inputmat,\; \outputmat T,\; \feedmat\,) with the same transfer function H(s)H(s).

Proof.

Substitute h=Tz\statevec = Tz into Definition 11.1: z˙=T1ATz+T1Bu\dot z = T^{-1}\statemat T z + T^{-1}\inputmat u and y=CTz+Duy = \outputmat T z + \feedmat u, so the transformed matrices are as stated. Its transfer function is

H~(s)=(CT)(sIT1AT)1(T1B)+D(Def. 11.2 on the transformed system)=CT[T1(sIA)T]1T1B+D(sI=T1(sI)T)=CTT1(sIA)1TT1B+D=C(sIA)1B+D=H(s).(TT1=I)\begin{aligned} \tilde H(s) &= (\outputmat T)\big(sI - T^{-1}\statemat T\big)^{-1}(T^{-1}\inputmat) + \feedmat && \text{(Def. 11.2 on the transformed system)} \\ &= \outputmat T\,\big[\,T^{-1}(sI - \statemat)T\,\big]^{-1}T^{-1}\inputmat + \feedmat && \text{($sI = T^{-1}(sI)T$)} \\ &= \outputmat T\,T^{-1}(sI - \statemat)^{-1}T\,T^{-1}\inputmat + \feedmat = \outputmat(sI - \statemat)^{-1}\inputmat + \feedmat = H(s). && \text{($T T^{-1} = I$)} \end{aligned}

The similarity transform cancels, leaving HH unchanged. \qquad\blacksquare

So the state-space model is a coordinate choice on the same dynamics, and the transfer function is the coordinate-free invariant.

The reverse direction — building a state-space model from a transfer function — is realization, and its non-uniqueness (canonical forms) is the companion’s round-trip check.

Discretization: the bridge to a step map

Control plants are continuous, but computation and the MDP view are discrete. Exact zero-order-hold (ZOH) discretization at step Δ\stepsize — holding uu constant across the interval — gives

Aˉ=eAΔ,Bˉ=(0ΔeAτdτ)B,\discA = e^{\statemat\stepsize}, \qquad \discB = \Big(\textstyle\int_0^{\stepsize} e^{\statemat\tau}\,d\tau\Big)\inputmat,

the matrix exponential of AΔ\statemat\stepsize. The cheap alternative, forward Euler, takes AˉI+AΔ\discA \approx I + \statemat\stepsize — the first two terms of that exponential, accurate only for small Δ\stepsize. Either way the result hk+1=Aˉhk+Bˉuk\statevec_{k+1} = \discA\statevec_k + \discB u_k is a deterministic step map.

The dynamic-programming bridge

That step map closes the loop with Part I. hk+1=Aˉhk+Bˉuk\statevec_{k+1} = \discA\statevec_k + \discB u_k is precisely a Markov decision process’s transition — deterministic, linear, and known — where Chapters 1–10 had a stochastic kernel learned from samples. Control theory’s premise is that the model is in hand, so the Bellman recursion can be solved exactly rather than approximated from experience. Three threads run forward from here:

  • To stability and structure (Week 12). Before designing any controller, ask what the model permits: stability from the eigenvalues of A\statemat, and whether the input can steer and the output can observe the full state.
  • To LQR (Week 13). Put a quadratic cost on this linear model and the Bellman equation has a closed-form solution — the Riccati equation — the exact dynamic-programming fixed point Chapter 1’s HJB aside promised.
  • To SSMs (Week 24). The very same (A,B,C,D)(\statemat,\inputmat,\outputmat,\feedmat) and its ZOH discretization are a structured-state-space neural layer (S4, Mamba); the control plant and the sequence model share one skeleton.

What’s next

  • Week 12 asks what can be known about a state-space model before designing a controller: stability (eigenvalue locations, Lyapunov), and the structural properties of controllability and observability that decide whether control and estimation are even possible.

Exercises

  1. (Derive) From mq¨+cq˙+kq=um\ddot{q} + c\dot{q} + kq = u with state h=(q,q˙)\statevec=(q,\dot q), derive the matrices A,B,C\statemat,\inputmat,\outputmat and the transfer function H(s)=1/(ms2+cs+k)H(s) = 1/(m s^2 + c s + k).

    Solution

    q˙1=q2\dot q_1 = q_2 and q˙2=(ucq2kq1)/m\dot q_2 = (u - c q_2 - k q_1)/m give the stated A,B\statemat, \inputmat; measuring position gives C=(1 0)\outputmat = (1\ 0). Then H(s)=C(sIA)1B=1det(sIA)1m=1ms2+cs+kH(s) = \outputmat(sI-\statemat)^{-1}\inputmat = \tfrac{1}{\det(sI-\statemat)}\cdot\tfrac1m = \tfrac{1}{m s^2 + c s + k}, since det(sIA)=s2+(c/m)s+k/m\det(sI-\statemat) = s^2 + (c/m)s + k/m.

  2. (Prove) Show the transfer function is invariant under a state-similarity transform h=Tz\statevec = Tz (Prop. 11.1), identifying where TT cancels.

    Solution

    The transformed system is (T1AT,T1B,CT,D)(T^{-1}\statemat T, T^{-1}\inputmat, \outputmat T, \feedmat); using sIT1AT=T1(sIA)TsI - T^{-1}\statemat T = T^{-1}(sI-\statemat)T and inverting, the TT and T1T^{-1} on either side of (sIA)1(sI-\statemat)^{-1} cancel against CT\outputmat T and T1BT^{-1}\inputmat, returning H(s)H(s).

  3. (Compute) For A=(0123)\statemat=\begin{pmatrix}0&1\\-2&-3\end{pmatrix}, B=(01)\inputmat=\begin{pmatrix}0\\1\end{pmatrix}, C=(1 0)\outputmat=(1\ 0), D=0\feedmat=0, compute H(s)H(s) and its poles.

    Solution

    det(sIA)=s2+3s+2=(s+1)(s+2)\det(sI-\statemat) = s^2+3s+2 = (s+1)(s+2), and H(s)=1/(s2+3s+2)H(s) = 1/(s^2+3s+2). Poles at s=1,2s=-1,-2 — the eigenvalues of A\statemat, both in the left half-plane (stable, as Week 12 will formalize).

  4. (Implement) In the companion, verify the SS→TF→SS round-trip preserves the transfer function, the mass–spring–damper TF matches 1/(ms2+cs+k)1/(ms^2+cs+k), and the step response settles to the DC gain H(0)=1/kH(0)=1/k.

    Solution

    See experiments/python/week11/test_state_space.py: python-control converts SS↔TF (round-trip preserves the rational function up to a state similarity); the mass–spring–damper TF coefficients match the analytic form; and the unit-step final value equals the final-value-theorem prediction H(0)=1/kH(0)=1/k.

  5. (Extend) Compare exact ZOH (Aˉ=eAΔ\discA = e^{\statemat\stepsize}) against forward Euler (AˉI+AΔ\discA \approx I + \statemat\stepsize): how does the discretization error scale with Δ\stepsize?

    Solution

    ZOH is exact for piecewise-constant input; Euler is its first-order truncation, with error O(Δ2)O(\stepsize^2) per step (it drops the 12A2Δ2\tfrac12\statemat^2\stepsize^2 and higher terms of the exponential). The companion shows the gap vanishing as Δ0\stepsize\to0 and growing — eventually destabilizing Euler — for large Δ\stepsize.

Companion code

The Week-11 companion lives at experiments/python/week11/ and is the first control companion (Python, on scipy + the python-control library).

  • state_space.py — builds the mass–spring–damper state-space model, converts SS↔TF (python-control), simulates impulse/step/forced responses, and discretizes via exact ZOH (scipy.linalg.expm) and forward Euler.
  • test_state_space.py — mathematical-correctness tests: the SS→TF→SS round-trip preserves the transfer function; the mass–spring–damper TF equals 1/(ms2+cs+k)1/(ms^2+cs+k); the step response settles to the DC gain H(0)=1/kH(0)=1/k; ZOH equals eAΔe^{\statemat\stepsize} while Euler is its first-order truncation; and the transfer function is invariant under a random state-similarity transform (Prop. 11.1).
# core algorithms + correctness tests (scipy + python-control)
PYTHONPATH=. pytest experiments/python/week11/test_state_space.py -q

# worked mass-spring-damper SS/TF + response plots (saved locally, not committed)
PYTHONPATH=. python experiments/python/week11/state_space.py --plot
Part: control Week 12 Published structural.py test_structural.py

Stability, Controllability, and Observability

The structural properties of a linear model knowable before any controller exists: internal stability via eigenvalues and the Lyapunov equation, controllability and observability as reachability and state-identifiability, and the duality that makes them one theory — and makes the LQR regulator and the Kalman estimator one computation.

Stability, Controllability, and Observability

Where we are. Week 11 built the linear state-space model and showed the transfer function is its coordinate-free invariant. Before designing any controller, three structural questions decide what is achievable on that model: is it internally stable; can the input steer the whole state (controllability); can the output reconstruct the whole state (observability)? These are the linear-systems analogues of reachability and identifiability in reinforcement learning — they bound what any policy or estimator can do, before optimality is even on the table. LQR and the Kalman filter (Week 13) will need exactly the mild weakenings of these properties.

Internal stability

Definition 12.1 (Asymptotic stability).

The autonomous system h˙=Ah\dot{\statevec} = \statemat\statevec (zero input) is asymptotically stable if h(t)0\statevec(t) \to 0 as tt \to \infty from every initial state. Its discrete-time counterpart hk+1=Ahk\statevec_{k+1} = \statemat\statevec_k is asymptotically stable if hk0\statevec_k \to 0 for every h0\statevec_0.

For a linear system, asymptotic stability is decided entirely by the spectrum of A\statemat.

Theorem 12.1 (Spectral stability condition).

The continuous-time system h˙=Ah\dot{\statevec} = \statemat\statevec is asymptotically stable iff every eigenvalue of A\statemat has strictly negative real part — A\statemat is Hurwitz, spec(A){λ:Reλ<0}\spec(\statemat) \subset \{\lambda : \operatorname{Re}\lambda < 0\}. The discrete-time system is asymptotically stable iff every eigenvalue lies strictly inside the unit disk — A\statemat is Schur, the spectral radius ρ(A)<1\spectralradius(\statemat) < 1.

The reason is the modal decomposition: solutions of h˙=Ah\dot{\statevec} = \statemat\statevec are combinations of eλte^{\lambda t} over the eigenvalues λ\lambda (with polynomial factors at repeated eigenvalues), and eλt0e^{\lambda t} \to 0 iff Reλ<0\operatorname{Re}\lambda < 0; in discrete time the modes are λk\lambda^k, which vanish iff λ<1|\lambda| < 1.

Eigenvalues answer the question but require an eigendecomposition. The Lyapunov equation gives an equivalent algebraic certificate — and it is the template for the Riccati equation of Week 13.

Theorem 12.2 (Lyapunov stability theorem).

A\statemat is Hurwitz iff for every symmetric Q0Q \succ 0 the Lyapunov equation

AP+PA=Q\statemat^\top P + P\statemat = -Q

has a unique solution PP, and that PP is symmetric positive definite. Then V(h)=hPhV(\statevec) = \statevec^\top P \statevec is a Lyapunov function: V0V \succ 0 and V˙=hQh<0\dot V = -\statevec^\top Q \statevec < 0 along every nonzero trajectory.

Proof.

(\Leftarrow, the construction.) Suppose A\statemat is Hurwitz and fix Q0Q \succ 0. Define

P0eAtQeAtdt.P \defeq \int_0^\infty e^{\statemat^\top t}\, Q\, e^{\statemat t}\,dt .

Hurwitzness gives eAtMeαt\lVert e^{\statemat t} \rVert \le M e^{-\alpha t} for some M,α>0M,\alpha > 0, so the integrand decays exponentially and the integral converges; PP is symmetric because QQ is. It is positive definite: for h0\statevec \neq 0, hPh=0(eAth)Q(eAth)dt>0\statevec^\top P \statevec = \int_0^\infty (e^{\statemat t}\statevec)^\top Q\,(e^{\statemat t}\statevec)\,dt > 0 since Q0Q \succ 0 and eAth0e^{\statemat t}\statevec \neq 0. Finally it solves the equation:

AP+PA=0(AeAtQeAt+eAtQeAtA)dt(linearity)=0ddt ⁣(eAtQeAt)dt(ddteAt=AeAt=eAtA)=[eAtQeAt]0=0Q=Q.(Hurwitz: upper limit vanishes)\begin{aligned} \statemat^\top P + P\statemat &= \int_0^\infty \big(\statemat^\top e^{\statemat^\top t} Q\, e^{\statemat t} + e^{\statemat^\top t} Q\, e^{\statemat t}\statemat\big)\,dt && \text{(linearity)} \\ &= \int_0^\infty \frac{d}{dt}\!\left( e^{\statemat^\top t} Q\, e^{\statemat t}\right) dt && \big(\tfrac{d}{dt}e^{\statemat t} = \statemat e^{\statemat t} = e^{\statemat t}\statemat\big) \\ &= \Big[\, e^{\statemat^\top t} Q\, e^{\statemat t}\,\Big]_0^\infty = 0 - Q = -Q . && \text{(Hurwitz: upper limit vanishes)} \end{aligned}

The Lyapunov-function claim follows by differentiating V(h)=hPhV(\statevec) = \statevec^\top P \statevec along h˙=Ah\dot{\statevec} = \statemat\statevec: V˙=h(AP+PA)h=hQh<0\dot V = \statevec^\top(\statemat^\top P + P\statemat)\statevec = -\statevec^\top Q \statevec < 0. The converse — a positive-definite PP forcing the spectrum into the open left half-plane — is the standard direction in Sontag (1998) . \qquad\blacksquare

So stability has two faces: a spectral test (eigenvalues) and an algebraic certificate (a quadratic energy VV that strictly decreases). The Lyapunov view generalizes to nonlinear systems (Week 14) and, with a control term added, becomes the Riccati equation that solves LQR (Week 13).

Controllability

Definition 12.2 (Controllability).

The pair (A,B)(\statemat,\inputmat) is controllable if for any initial state h0\statevec_0 and any target hf\statevec_f there is an input u()u(\cdot) on some finite interval that drives the state from h0\statevec_0 to hf\statevec_f. Equivalently, every state is reachable from the origin.

Theorem 12.3 (Kalman controllability rank condition).

(A,B)(\statemat,\inputmat) with ARn×n\statemat \in \R^{n\times n} is controllable iff the controllability matrix

C=(BABA2BAn1B)Rn×nm\ctrbmat = \begin{pmatrix} \inputmat & \statemat\inputmat & \statemat^2\inputmat & \cdots & \statemat^{\,n-1}\inputmat \end{pmatrix} \in \R^{n\times nm}

has full row rank, rankC=n\rank\,\ctrbmat = n.

Only powers up to An1\statemat^{n-1} appear: by the Cayley–Hamilton theorem An\statemat^n is a linear combination of I,A,,An1I, \statemat, \dots, \statemat^{n-1}, so higher powers add no new directions.

An equivalent test uses the controllability Gramian Wc=0TeAτBBeAτdτW_c = \int_0^T e^{\statemat\tau}\inputmat\inputmat^\top e^{\statemat^\top\tau}\,d\tau, which is positive definite iff (A,B)(\statemat,\inputmat) is controllable; for Hurwitz A\statemat the infinite-horizon Gramian solves a Lyapunov equation AWc+WcA=BB\statemat W_c + W_c\statemat^\top = -\inputmat\inputmat^\top, tying controllability back to the previous section.

Observability

Observability is the same question asked of the output map: can the measurements pin down the state?

Definition 12.3 (Observability).

The pair (A,C)(\statemat,\outputmat) is observable if the initial state h0\statevec_0 can be uniquely determined from the output y()y(\cdot) over any finite interval (the input being known).

Theorem 12.4 (Observability rank condition).

(A,C)(\statemat,\outputmat) with ARn×n\statemat \in \R^{n\times n}, CRp×n\outputmat \in \R^{p\times n} is observable iff the observability matrix

O=(CCACA2CAn1)Rnp×n\obsvmat = \begin{pmatrix} \outputmat \\ \outputmat\statemat \\ \outputmat\statemat^2 \\ \vdots \\ \outputmat\statemat^{\,n-1} \end{pmatrix} \in \R^{np\times n}

has full column rank, rankO=n\rank\,\obsvmat = n.

The two rank conditions are visibly mirror images — C\ctrbmat stacks AkB\statemat^k\inputmat horizontally, O\obsvmat stacks CAk\outputmat\statemat^k vertically. That mirror is not a coincidence.

Duality

Proposition 12.1 (Controllability–observability duality).

(A,C)(\statemat,\outputmat) is observable iff (A,C)(\statemat^\top,\outputmat^\top) is controllable; dually, (A,B)(\statemat,\inputmat) is controllable iff (A,B)(\statemat^\top,\inputmat^\top) is observable. Concretely, the observability matrix of (A,C)(\statemat,\outputmat) is the transpose of the controllability matrix of the dual pair,

O(A,C)=C(A,C).\obsvmat(\statemat,\outputmat) = \ctrbmat(\statemat^\top,\outputmat^\top)^\top .
Proof.

Write the dual controllability matrix, and transpose its blocks one at a time:

C(A,C)=(CAC(A)n1C),[(A)kC]=CAk.\ctrbmat(\statemat^\top,\outputmat^\top) = \begin{pmatrix} \outputmat^\top & \statemat^\top\outputmat^\top & \cdots & (\statemat^\top)^{n-1}\outputmat^\top \end{pmatrix}, \qquad \big[(\statemat^\top)^k\outputmat^\top\big]^\top = \outputmat\statemat^k .

Transposing turns the horizontal blocks into vertical ones, [CAC]=(C; CA; ; CAn1)=O(A,C)\big[\,\outputmat^\top \mid \statemat^\top\outputmat^\top \mid \cdots\,\big]^\top = (\outputmat;\ \outputmat\statemat;\ \dots;\ \outputmat\statemat^{\,n-1}) = \obsvmat(\statemat,\outputmat). Since rank is invariant under transposition, rankO(A,C)=rankC(A,C)\rank\,\obsvmat(\statemat,\outputmat) = \rank\,\ctrbmat(\statemat^\top,\outputmat^\top), and the rank conditions of Theorems 12.3–12.4 give the equivalence. \qquad\blacksquare

Duality halves the theory: every controllability result has a free observability twin. It is also why LQR (state feedback) and the Kalman filter (state estimation) are the same Riccati computation run on dual systems — the deepest structural fact Week 13 will exploit.

The structural bridge to learning

These properties are the control-theoretic statements of limits that also bound reinforcement learning:

  • Controllability ↔ reachability. An uncontrollable mode is a direction of state space no input — and therefore no policy — can affect. It is the linear, exact version of the reachable-set question that determines what any RL agent can possibly accomplish on a system.
  • Observability ↔ identifiability. An unobservable mode is a latent direction the outputs never reveal, so no estimator can recover it from data. This is precisely the gap between a fully observed MDP and a POMDP, where the agent must act on a belief over hidden state.
  • The weakenings LQR needs. Optimal control does not require full controllability/observability, only stabilizability (every uncontrollable mode is already stable) and detectability (every unobservable mode is already stable). These are the exact conditions under which the dynamic-programming value function of Week 13 exists and yields a stabilizing policy — a structural prerequisite, checked before any optimization, for the Bellman recursion to have a well-behaved fixed point.

What’s next

  • Week 13 (LQR/LQG). Put a quadratic cost on the controllable linear model and the Bellman equation collapses to the algebraic Riccati equation — the Lyapunov equation of this chapter with a control term subtracted. Stabilizability and detectability are exactly what make its stabilizing solution exist; duality makes the optimal regulator and the optimal estimator one calculation. This is where dynamic programming and control theory become the same equation.

Exercises

  1. (Compute) For A=(1203)\statemat = \begin{pmatrix} -1 & 2 \\ 0 & -3 \end{pmatrix}, decide continuous-time (Hurwitz) and discrete-time (Schur) asymptotic stability.

    Solution

    The eigenvalues are 1,3-1, -3 (triangular matrix). Both have Re<0\operatorname{Re} < 0, so A\statemat is Hurwitz and the flow h˙=Ah\dot{\statevec} = \statemat\statevec is asymptotically stable. As a discrete map, 1=1|-1| = 1 and 3=3|-3| = 3 are not inside the unit disk, so A\statemat is not Schur and hk+1=Ahk\statevec_{k+1} = \statemat\statevec_k is unstable. The same matrix is stable as a flow but unstable as a map — stability is a property of the system type, not of A\statemat alone.

  2. (Prove) With V(h)=hPhV(\statevec) = \statevec^\top P \statevec, P0P \succ 0 solving AP+PA=Q\statemat^\top P + P\statemat = -Q for Q0Q \succ 0, show V˙<0\dot V < 0 along h˙=Ah\dot{\statevec} = \statemat\statevec and conclude asymptotic stability.

    Solution

    V˙=h˙Ph+hPh˙=h(AP+PA)h=hQh<0\dot V = \dot{\statevec}^\top P \statevec + \statevec^\top P \dot{\statevec} = \statevec^\top(\statemat^\top P + P\statemat)\statevec = -\statevec^\top Q \statevec < 0 for h0\statevec \neq 0. A positive-definite function strictly decreasing along every trajectory forces h0\statevec \to 0 (Lyapunov’s direct method), so the origin is asymptotically stable — without computing a single eigenvalue.

  3. (Compute) For A=(1002)\statemat = \begin{pmatrix} -1 & 0 \\ 0 & -2 \end{pmatrix}, B=(10)\inputmat = \begin{pmatrix} 1 \\ 0 \end{pmatrix}, form C\ctrbmat and decide controllability.

    Solution

    C=(B  AB)=(1100)\ctrbmat = (\inputmat\ \ \statemat\inputmat) = \begin{pmatrix} 1 & -1 \\ 0 & 0 \end{pmatrix}, which has rank 1<21 < 2, so (A,B)(\statemat,\inputmat) is uncontrollable. The second mode (the 2-2 eigendirection) has no input coupling, so no uu can move it. Because that mode is already stable, the system is still stabilizable — the weaker property LQR needs.

  4. (Prove) Show O(A,C)=C(A,C)\obsvmat(\statemat,\outputmat) = \ctrbmat(\statemat^\top,\outputmat^\top)^\top, and hence observability of (A,C)(\statemat,\outputmat) is equivalent to controllability of (A,C)(\statemat^\top,\outputmat^\top).

    Solution

    As in Proposition 12.1: [(A)kC]=CAk[(\statemat^\top)^k\outputmat^\top]^\top = \outputmat\statemat^k, so transposing the horizontal blocks of C(A,C)\ctrbmat(\statemat^\top,\outputmat^\top) produces the vertical observability stack O(A,C)\obsvmat(\statemat,\outputmat). Rank is invariant under transposition, so the two full-rank conditions coincide.

  5. (Implement) In the companion, verify that the controllable/uncontrollable and observable/unobservable examples have the predicted ranks, that the Lyapunov solution PP is positive definite iff A\statemat is Hurwitz, and that duality holds as a rank equality.

    Solution

    See experiments/python/week12/test_structural.py: is_controllable/is_observable match the Kalman rank conditions; lyapunov_solve returns a positive-definite PP for the Hurwitz oscillator and an indefinite PP for an unstable A\statemat (the certificate failing exactly when stability does); and observability_matrix(A.T, C.T) equals controllability_matrix(A, C).T.

  6. (Extend) The rank tests are exact in arithmetic but decided numerically by a singular-value threshold. Show that the verdict of numpy.linalg.matrix_rank on a controllable system depends on the tolerance convention: scale the input by ε\varepsilon and compare numpy’s default (relative) tolerance against a fixed absolute one.

    Solution

    Scaling the input column by ε\varepsilon shrinks every singular value of C\ctrbmat in proportion, so C\ctrbmat stays full rank for every ε>0\varepsilon > 0. numpy’s default tolerance is relativeσmaxmax(m,n)εmach\sigma_{\max}\cdot\max(m,n)\cdot\varepsilon_{\text{mach}} — hence scale-invariant: it reports rank nn at every ε\varepsilon. A fixed absolute threshold tells a different story: once the smallest singular value drops below it, the same controllable system reads as rank-deficient. The companion’s --scaling-demo prints both verdicts side by side (the default stays 22; the absolute one flips to 11, then 00). The lesson: controllability is binary in exact arithmetic but graded in finite precision, and “numerically uncontrollable” names a chosen tolerance, not the system — the Gramian’s smallest eigenvalue is the tolerance-free measure of how controllable a system is.

Companion code

The Week-12 companion lives at experiments/python/week12/ (Python, on scipy + numpy, cross-checked against python-control).

  • structural.py — Hurwitz / Schur stability tests; the Lyapunov solver for AP+PA=Q\statemat^\top P + P\statemat = -Q (scipy.linalg.solve_continuous_lyapunov); controllability and observability matrices with their rank tests; the controllability Gramian; and a built (un)controllable / (un)observable dual pair.
  • test_structural.py — mathematical-correctness tests: the stability tests agree with the eigenvalues; the Lyapunov PP is positive definite iff A\statemat is Hurwitz, with residual AP+PA+Q0\statemat^\top P + P\statemat + Q \approx 0; the rank conditions classify the controllable/observable examples and their rank-deficient duals; the Gramian solves AWc+WcA=BB\statemat W_c + W_c\statemat^\top = -\inputmat\inputmat^\top and is positive definite iff controllable; and duality holds as O(A,C)=C(A,C)\obsvmat(\statemat,\outputmat) = \ctrbmat(\statemat^\top,\outputmat^\top)^\top.
# structural algorithms + correctness tests (scipy + python-control cross-check)
PYTHONPATH=. pytest experiments/python/week12/test_structural.py -q

# worked summary table for several small systems + the rank-under-scaling demo
PYTHONPATH=. python experiments/python/week12/structural.py --scaling-demo
Part: control Week 13 Published lqr.py test_lqr.py

Linear-Quadratic Regulation: The Exact Dynamic Program

The linear-quadratic regulator as exact dynamic programming: a quadratic value function, the Riccati recursion as Chapter 1's Bellman optimality equation in coordinates, the linear-feedback optimal policy, the infinite-horizon algebraic Riccati equation, and the LQG separation principle — with Doyle's warning that optimal output feedback carries no guaranteed stability margins.

Linear-Quadratic Regulation: The Exact Dynamic Program

Where we are. Weeks 11–12 built the linear model and the structural properties — stability, controllability, observability — that say what control is possible. Now we put a cost on the model and ask for the best control. The linear-quadratic regulator (LQR) is the one optimal-control problem dynamic programming solves in closed form, and it is exactly Chapter 1’s Bellman optimality equation specialized to linear dynamics and quadratic cost. The “DP bridge” promised since Chapter 1 is paid here: the value function is a quadratic, value iteration becomes the Riccati recursion, and the optimal policy is linear state feedback u=Khu = -\lqrgain\statevec. Bellman, Riccati, and feedback control turn out to be one calculation.

The LQR problem

Definition 13.1 (Linear-quadratic regulator).

For the discrete-time linear system hk+1=Ahk+Buk\statevec_{k+1} = \statemat\statevec_k + \inputmat u_k, the finite-horizon LQR problem is to choose u0,,uN1u_0,\dots,u_{\horizon-1} minimizing the quadratic cost

J=k=0N1(hkQhk+ukRuk)+hNQNhN,\costtogo = \sum_{k=0}^{\horizon-1}\big(\statevec_k^\top Q\,\statevec_k + u_k^\top R\,u_k\big) + \statevec_\horizon^\top Q_\horizon\,\statevec_\horizon,

with state-cost Q0Q \succeq 0, control-cost R0R \succ 0, and terminal cost QN0Q_\horizon \succeq 0. The infinite-horizon problem takes N\horizon\to\infty with no terminal term, minimizing k=0(hkQhk+ukRuk)\sum_{k=0}^{\infty}(\statevec_k^\top Q\,\statevec_k + u_k^\top R\,u_k).

R0R \succ 0 makes every control expensive, so the minimizer is unique and finite; Q0Q \succeq 0 penalizes leaving the origin. This is a Markov decision process (Chapter 1) with a known deterministic linear kernel and a quadratic cost in place of a reward.

LQR is exact dynamic programming

Theorem 13.1 (LQR via dynamic programming).

The optimal cost-to-go of Definition 13.1 is the quadratic vk(h)=hPkh\valuefn_k^*(\statevec) = \statevec^\top \riccati_k\statevec, where Pk\riccati_k runs the backward Riccati recursion

PN=QN,Pk=Q+APk+1AAPk+1B(R+BPk+1B)1BPk+1A,\riccati_\horizon = Q_\horizon, \qquad \riccati_k = Q + \statemat^\top \riccati_{k+1}\statemat - \statemat^\top \riccati_{k+1}\inputmat\,(R + \inputmat^\top \riccati_{k+1}\inputmat)^{-1}\inputmat^\top \riccati_{k+1}\statemat,

and the optimal policy is the linear state feedback uk=Kkhku_k = -\lqrgain_k\statevec_k with time-varying gain

Kk=(R+BPk+1B)1BPk+1A.\lqrgain_k = (R + \inputmat^\top \riccati_{k+1}\inputmat)^{-1}\inputmat^\top \riccati_{k+1}\statemat.
Proof.

Backward induction on the Bellman optimality equation vk(h)=minu[hQh+uRu+vk+1(Ah+Bu)]\valuefn_k^*(\statevec) = \min_u\big[\statevec^\top Q\statevec + u^\top R u + \valuefn_{k+1}^*(\statemat\statevec + \inputmat u)\big].

Base. vN(h)=hQNh\valuefn_\horizon^*(\statevec) = \statevec^\top Q_\horizon\statevec, so PN=QN\riccati_\horizon = Q_\horizon.

Step. Assume vk+1(h)=hPk+1h\valuefn_{k+1}^*(\statevec) = \statevec^\top \riccati_{k+1}\statevec. Substituting and expanding the quadratic in uu,

hQh+uRu+(Ah+Bu)Pk+1(Ah+Bu)=u ⁣(R+BPk+1B)0u+2uBPk+1Ah+h(Q+APk+1A)h.\begin{aligned} \statevec^\top Q\statevec + u^\top R u + (\statemat\statevec + \inputmat u)^\top \riccati_{k+1}(\statemat\statevec + \inputmat u) &= u^\top\!\underbrace{(R + \inputmat^\top \riccati_{k+1}\inputmat)}_{\textstyle \succ 0}\,u + 2\,u^\top \inputmat^\top \riccati_{k+1}\statemat\,\statevec \\ &\quad + \statevec^\top(Q + \statemat^\top \riccati_{k+1}\statemat)\statevec . \end{aligned}

The Hessian in uu is R+BPk+1B0R + \inputmat^\top \riccati_{k+1}\inputmat \succ 0 (since R0R\succ0, Pk+10\riccati_{k+1}\succeq0), so the minimizer is the stationary point

u=(R+BPk+1B)1BPk+1Ah=Kkh.u^* = -(R + \inputmat^\top \riccati_{k+1}\inputmat)^{-1}\inputmat^\top \riccati_{k+1}\statemat\,\statevec = -\lqrgain_k\statevec .

Substituting uu^* back (completing the square) leaves a pure quadratic in h\statevec, vk(h)=hPkh\valuefn_k^*(\statevec) = \statevec^\top \riccati_k\statevec, whose matrix is exactly the Riccati recursion above. The quadratic form is preserved, closing the induction. \qquad\blacksquare

The proof is Chapter 1’s value iteration run on a quadratic ansatz.

Every step is one application of the Bellman optimality operator; the minimization is exact (a quadratic in uu), not sampled or approximated, because the model is known and the cost is quadratic. The optimal value, the optimal policy, and the optimal cost v0(h0)=h0P0h0\valuefn_0^*(\statevec_0) = \statevec_0^\top \riccati_0\statevec_0 all fall out of one backward sweep.

Infinite horizon: the algebraic Riccati equation

Run the recursion backward from a far horizon and Pk\riccati_k settles to a constant.

Theorem 13.2 (Infinite-horizon LQR).

If (A,B)(\statemat,\inputmat) is stabilizable and (A,Q1/2)(\statemat, Q^{1/2}) is detectable, then as N\horizon\to\infty the Riccati iterates converge, PkP\riccati_k\to\riccati, to the unique symmetric positive-semidefinite solution of the discrete algebraic Riccati equation (DARE)

P=Q+APAAPB(R+BPB)1BPA.\riccati = Q + \statemat^\top \riccati\,\statemat - \statemat^\top \riccati\inputmat\,(R + \inputmat^\top \riccati\inputmat)^{-1}\inputmat^\top \riccati\,\statemat .

The optimal policy is the stationary feedback u=Khu = -\lqrgain\statevec with K=(R+BPB)1BPA\lqrgain = (R + \inputmat^\top \riccati\inputmat)^{-1}\inputmat^\top \riccati\statemat, and the closed loop ABK\statemat - \inputmat\lqrgain is asymptotically stable (Schur).

The hypotheses are exactly Week 12’s structural properties: stabilizability (every uncontrollable mode already stable) guarantees a finite-cost policy exists, and detectability (every unobservable-through-QQ mode already stable) guarantees the stabilizing solution is unique.

The standard proof shows the Riccati map is a monotone contraction on the positive-semidefinite cone; see Lewis et al. (2012) and Bertsekas (2017) .

Continuous time. The same logic in continuous time replaces the recursion by the Hamilton–Jacobi–Bellman equation minuH=0\min_u \hamiltonian = 0, whose Hamiltonian H=hQh+uRu+(hv)(Ah+Bu)\hamiltonian = \statevec^\top Q\statevec + u^\top R u + (\nabla_{\statevec}\valuefn^*)^\top(\statemat\statevec + \inputmat u) adds the stage cost to the cost-to-go’s rate of change along the dynamics. Its quadratic solution gives the continuous algebraic Riccati equation AP+PAPBR1BP+Q=0\statemat^\top \riccati + \riccati\statemat - \riccati\inputmat R^{-1}\inputmat^\top \riccati + Q = 0, with gain K=R1BP\lqrgain = R^{-1}\inputmat^\top \riccati and Hurwitz closed loop ABK\statemat - \inputmat\lqrgain. The HJB equation is the continuous-time Bellman equation; Chapter 1’s HJB aside lands exactly here. Note the kinship with Week 12: drop the control term PBR1BP\riccati\inputmat R^{-1}\inputmat^\top \riccati and the ARE is the Lyapunov equation — optimal control is stability analysis with a cost-shaping term.

The bridge to Chapter 1

This is the chapter the curriculum has been pointing at. Lay the two equations side by side:

  • Chapter 1 (general MDP). v(s)=maxa[r(s,a)+γsp(ss,a)v(s)]\optvaluefn(s) = \max_a\big[\reward(s,a) + \discount\sum_{s'}\transition(s'\mid s,a)\,\optvaluefn(s')\big], solved by value iteration, which converges because T\bellmanopt is a γ\discount-contraction.
  • Chapter 13 (linear-quadratic). v(h)=minu[hQh+uRu+v(Ah+Bu)]\valuefn^*(\statevec) = \min_u\big[\statevec^\top Q\statevec + u^\top R u + \valuefn^*(\statemat\statevec + \inputmat u)\big], solved by the Riccati recursion, which converges because the Riccati map contracts on the PSD cone.

They are the same equation — minimize immediate cost plus optimal cost-to-go — under one specialization: linear dynamics and quadratic cost keep v\valuefn^* quadratic, collapsing the infinite-dimensional functional fixed point to the finite matrix P\riccati. Value iteration \leftrightarrow Riccati recursion; the Bellman operator \leftrightarrow the Riccati map; the γ\discount-contraction that gave Chapter 1 its convergence \leftrightarrow the stabilizability/detectability that gives the ARE its unique stabilizing solution. Control theory reached the Riccati equation from the calculus of variations and Pontryagin’s principle Kalman (1960) ; reinforcement learning reached value iteration from the Bellman equation; LQR is where the two derivations meet on one object.

LQG: optimal output feedback, and its fragility

Real systems are noisy and only partially measured: hk+1=Ahk+Buk+wk\statevec_{k+1} = \statemat\statevec_k + \inputmat u_k + w_k, yk=Chk+vky_k = \outputmat\statevec_k + v_k with Gaussian w,vw,v. The linear-quadratic-Gaussian (LQG) problem minimizes the expected quadratic cost. Its solution is the separation principle: run a Kalman filter Kalman (1960) to produce the optimal state estimate h^\hat{\statevec}, then apply the LQR gain to the estimate, u=Kh^u = -\lqrgain\hat{\statevec} — estimator and regulator designed independently and combined.

By W12 duality the filter gain solves the dual Riccati equation, so LQG is two Riccati solves on dual systems.

The catch is robustness. Doyle (1978) — a one-page paper — shows LQG has no guaranteed stability margins: there are LQG designs an arbitrarily small gain perturbation destabilizes.

Proposition 13.1 (LQR margins vs. LQG fragility).

The full-state LQR loop has a guaranteed gain margin [12,)[\tfrac12,\infty) and at least 6060^\circ phase margin: scaling the optimal gain by any β12\beta\geq\tfrac12 leaves AβBK\statemat - \beta\inputmat\lqrgain stable. The LQG (output-feedback) loop has no such guarantee — its gain margin can be made arbitrarily small.

The companion reproduces this: the LQR loop stays stable across a wide gain scaling (β12\beta\geq\tfrac12), while the LQG loop on the same plant is stable only at the nominal β=1\beta = 1 — an arbitrarily small gain error on either side destabilizes it. The lesson is foundational for the rest of the curriculum: optimality on the nominal model does not imply robustness. The separation principle is optimal and fragile at once — the gap that robust control, and later robust/risk-aware RL, exist to close.

What’s next

  • Week 14 (nonlinear control). Lyapunov design and feedback linearization for systems where the linear model is only a local picture. The Lyapunov function of Week 12 becomes a design tool rather than an analysis certificate, and the quadratic value function of this chapter becomes a local approximation to a nonlinear cost-to-go — the entry to nonlinear optimal control and, eventually, model predictive control (Week 15).

Exercises

  1. (Derive) Starting from the Bellman optimality equation with vk+1(h)=hPk+1h\valuefn_{k+1}^*(\statevec) = \statevec^\top \riccati_{k+1}\statevec, complete the square in uu to derive the gain Kk\lqrgain_k and the Riccati recursion for Pk\riccati_k.

    Solution

    Expanding gives u(R+BPk+1B)u+2uBPk+1Ah+h(Q+APk+1A)hu^\top(R + \inputmat^\top \riccati_{k+1}\inputmat)u + 2u^\top \inputmat^\top \riccati_{k+1}\statemat\statevec + \statevec^\top(Q + \statemat^\top \riccati_{k+1}\statemat)\statevec. The uu-Hessian R+BPk+1B0R + \inputmat^\top \riccati_{k+1}\inputmat \succ 0, so the minimizer is u=(R+BPk+1B)1BPk+1Ah=Kkhu^* = -(R + \inputmat^\top \riccati_{k+1}\inputmat)^{-1}\inputmat^\top \riccati_{k+1}\statemat\statevec = -\lqrgain_k\statevec. Back-substitution yields vk=hPkh\valuefn_k^* = \statevec^\top \riccati_k\statevec with Pk=Q+APk+1AAPk+1B(R+BPk+1B)1BPk+1A\riccati_k = Q + \statemat^\top \riccati_{k+1}\statemat - \statemat^\top \riccati_{k+1}\inputmat(R + \inputmat^\top \riccati_{k+1}\inputmat)^{-1}\inputmat^\top \riccati_{k+1}\statemat.

  2. (Compute) Solve the scalar infinite-horizon LQR: A=a\statemat = a, B=b\inputmat = b, Q=qQ = q, R=rR = r (all scalars). Write the DARE for P=p\riccati = p and solve it.

    Solution

    The DARE is p=q+a2pa2b2p2r+b2pp = q + a^2 p - \dfrac{a^2 b^2 p^2}{r + b^2 p}, a quadratic in pp. Clearing the denominator gives b2p2(b2q+(a21)r)pqr=0b^2 p^2 - (b^2 q + (a^2-1)r)\,p - qr = 0; the positive root is the stabilizing p>0p > 0, and K=abpr+b2p\lqrgain = \dfrac{ab\,p}{r + b^2 p}. As a check, with a=0a=0 the state already dies in one step and p=qp = q, K=0\lqrgain = 0.

  3. (Prove) Show the optimal LQR cost from h0\statevec_0 is exactly h0P0h0\statevec_0^\top \riccati_0\statevec_0, and that under the stationary gain the cost-to-go hkPhk\statevec_k^\top \riccati\statevec_k is non-increasing along the closed-loop trajectory.

    Solution

    By Theorem 13.1, v0(h0)=h0P0h0\valuefn_0^*(\statevec_0) = \statevec_0^\top \riccati_0\statevec_0 is the minimum cost. Along the closed loop hk+1=(ABK)hk\statevec_{k+1} = (\statemat - \inputmat\lqrgain)\statevec_k, the DARE rearranges to hkPhkhk+1Phk+1=hk(Q+KRK)hk0\statevec_k^\top \riccati\statevec_k - \statevec_{k+1}^\top \riccati\statevec_{k+1} = \statevec_k^\top(Q + \lqrgain^\top R\lqrgain)\statevec_k \geq 0, so hkPhk\statevec_k^\top \riccati\statevec_k decreases by exactly the stage cost each step — hPh\statevec^\top \riccati\statevec is a Lyapunov function for the optimal closed loop, tying back to Week 12.

  4. (Implement) In the companion, verify the DARE residual is zero, the gain matches control.dlqr, the finite-horizon Riccati converges to the ARE solution as N\horizon grows, and the simulated closed-loop cost equals h0Ph0\statevec_0^\top \riccati\statevec_0.

    Solution

    See experiments/python/week13/test_lqr.py: dare_gain solves the DARE (scipy.linalg.solve_discrete_are) with residual 0\approx 0 and gain agreeing with control.dlqr; the backward recursion’s P0\riccati_0 converges to that P\riccati as N\horizon\to\infty; and the summed stage cost under Kh-\lqrgain\statevec matches h0Ph0\statevec_0^\top \riccati\statevec_0.

  5. (Extend) Reproduce the LQG margin failure. Build the LQR loop and the LQG loop (LQR gain on a Kalman estimate) for the same plant, scale the loop gain by β\beta, and compare the stable range of β\beta.

    Solution

    See the companion’s Doyle example: the full-state loop AβBK\statemat - \beta\inputmat\lqrgain stays stable for all β12\beta\geq\tfrac12 (Prop. 13.1), but the output-feedback closed loop [AβBKLCABKLC]\big[\begin{smallmatrix}\statemat & -\beta\inputmat\lqrgain \\ L\outputmat & \statemat - \inputmat\lqrgain - L\outputmat\end{smallmatrix}\big] is stable only at β=1\beta = 1 — an arbitrarily small perturbation either way destabilizes it (the gain error β\beta multiplies only the control reaching the plant, not the commanded control the filter propagates). Optimality of the estimator-plus-regulator does not transfer to a robustness guarantee.

  6. (Extend) Using Week-12 duality, show the steady-state Kalman filter gain solves the same algebraic Riccati equation as LQR on the dual pair (A,C)(\statemat^\top,\outputmat^\top). When does the separation principle stop being optimal (as opposed to merely stable)?

    Solution

    The filter error covariance solves Σ=AΣAAΣC(CΣC+V)1CΣA+W\Sigma = \statemat\Sigma\statemat^\top - \statemat\Sigma\outputmat^\top(\outputmat\Sigma\outputmat^\top + V)^{-1}\outputmat\Sigma\statemat^\top + W, which is the DARE of Theorem 13.2 with (A,B,Q,R)(A,C,W,V)(\statemat,\inputmat,Q,R)\mapsto(\statemat^\top,\outputmat^\top,W,V). Separation is optimal precisely for linear dynamics, Gaussian noise, and quadratic cost; it loses optimality once the dynamics are nonlinear, the noise non-Gaussian, or the cost non-quadratic — where estimation and control no longer decouple (a recurring theme in model-based RL).

Companion code

The Week-13 companion lives at experiments/python/week13/ (Python, on scipy.linalg + numpy, cross-checked against python-control).

  • lqr.py — the finite-horizon backward Riccati recursion; the infinite-horizon gain via the discrete and continuous algebraic Riccati equations (solve_discrete_are, solve_continuous_are); a closed-loop cost simulator; and the Doyle LQG margin-failure example (LQR gain margin vs. the output-feedback loop’s vanishing margin).
  • test_lqr.py — mathematical-correctness tests: the DARE/CARE residuals vanish; the gain equals control.dlqr / control.lqr; the finite-horizon P0\riccati_0 converges to the ARE solution as N\horizon\to\infty; the simulated optimal cost equals h0Ph0\statevec_0^\top \riccati\statevec_0; the closed loop is Schur/Hurwitz; and the LQR gain margin contains [12,)[\tfrac12,\infty) while the LQG loop is destabilized near β=1\beta = 1.
# LQR/LQG algorithms + correctness tests (scipy + python-control cross-check)
PYTHONPATH=. pytest experiments/python/week13/test_lqr.py -q

# worked finite/infinite-horizon LQR + the Doyle LQG margin demonstration
PYTHONPATH=. python experiments/python/week13/lqr.py --doyle
Part: control Week 14 Published nonlinear.py test_nonlinear.py

Nonlinear Control: Lyapunov Design, Feedback Linearization, and Sliding Modes

When the plant is nonlinear, eigenvalues and Riccati equations describe only a local picture. This chapter builds the tools that replace them: Lyapunov's direct method and LaSalle's invariance principle as global stability certificates, feedback linearization that cancels the nonlinearity by coordinate change, sliding-mode control that enforces a surface in finite time and is robust to matched uncertainty, and input-to-state stability and backstepping as the constructive bridge to robust and adaptive design — with the Lyapunov function read as the control-theoretic cousin of the reinforcement-learning value function.

Nonlinear Control: Lyapunov Design, Feedback Linearization, and Sliding Modes

Where we are. Weeks 11–13 lived in the linear world: a state-space model, the structural tests of controllability and observability, and the linear-quadratic regulator that solved the optimal-control problem in closed form. Every one of those tools is, at bottom, an eigenvalue statement — stability is the spectrum of A\statemat, the LQR closed loop is Schur because the Riccati matrix makes it so. But the linear model is only the tangent picture at one operating point. A pendulum, a quadrotor, a robot arm, a power grid: linearize and you get a local certificate that says nothing about the basin of attraction, the swing-up, or behavior far from equilibrium. This chapter asks the Week-14 question: what replaces eigenvalues and Riccati equations when the plant is genuinely nonlinear? The answer is a shift from spectra to energy — from “where are the poles” to “does a scalar certificate decrease along every trajectory.”

Why the linear tools run out

A nonlinear system h˙=f(h,u)\dot{\statevec} = f(\statevec, u), f(0,0)=0f(0,0)=0, linearized at the origin gives h˙Ah+Bu\dot{\statevec} \approx \statemat\statevec + \inputmat u with A=f/h\statemat = \partial f/\partial\statevec. Lyapunov’s indirect method says: if A\statemat (closed-loop) is Hurwitz, the origin is locally asymptotically stable — and that is all it says. It is silent on how large the basin is, blind to multiple equilibria and limit cycles, and useless when the linearization is marginal (eigenvalues on the imaginary axis), which is exactly when nonlinear terms decide stability. The pendulum makes this concrete: about the hanging equilibrium the linearization is a lightly damped oscillator, but the global behavior — every release angle spiraling to rest — is a nonlinear, energy statement the eigenvalues cannot certify. We need a tool that sees the whole state space at once.

Lyapunov’s direct method

The idea is mechanical-energy made abstract. Find a scalar V(h)0\lyap(\statevec) \geq 0 that is zero only at the equilibrium and decreases along trajectories; then trajectories cannot escape its sublevel sets, and if the decrease is strict they slide down to the equilibrium. No solution of the differential equation is required — the certificate is checked by differentiation alone.

Definition 14.1 (Stability of an equilibrium).

The origin of h˙=f(h)\dot{\statevec} = f(\statevec), f(0)=0f(0)=0, is stable (in the sense of Lyapunov) if for every ε>0\varepsilon > 0 there is a δ>0\delta > 0 such that h(0)<δ\norm{\statevec(0)} < \delta implies h(t)<ε\norm{\statevec(t)} < \varepsilon for all t0t \geq 0; asymptotically stable if it is stable and h(t)0\statevec(t) \to 0 for all h(0)\statevec(0) near the origin; and globally asymptotically stable if that holds for every h(0)\statevec(0).

Theorem 14.1 (Lyapunov's direct method).

Let V:DR\lyap : \mathcal{D} \to \mathbb{R} be continuously differentiable on a neighborhood D\mathcal{D} of the origin, positive definite (V(0)=0\lyap(0)=0 and V(h)>0\lyap(\statevec)>0 for h0\statevec \neq 0), with derivative along the flow V˙(h)=V(h)f(h)\dot{\lyap}(\statevec) = \nabla\lyap(\statevec)^\top f(\statevec). If V˙(h)0\dot{\lyap}(\statevec) \leq 0 on D\mathcal{D} the origin is stable; if V˙(h)<0\dot{\lyap}(\statevec) < 0 for h0\statevec \neq 0 it is asymptotically stable; and if additionally V\lyap is radially unbounded (V(h)\lyap(\statevec)\to\infty as h\norm{\statevec}\to\infty) it is globally asymptotically stable.

Proof.

Fix ε>0\varepsilon>0 and a ball BεDB_\varepsilon \subset \mathcal{D}. Let m=minh=εV(h)>0m = \min_{\norm{\statevec}=\varepsilon}\lyap(\statevec) > 0 by positive definiteness, and choose δ\delta so that V(h)<m\lyap(\statevec) < m on h<δ\norm{\statevec}<\delta. Since V˙0\dot{\lyap}\leq 0, V(h(t))\lyap(\statevec(t)) is non-increasing, so a trajectory starting inside h<δ\norm{\statevec}<\delta has V(h(t))<m\lyap(\statevec(t)) < m for all tt and can never reach the shell h=ε\norm{\statevec}=\varepsilon — stability. If V˙<0\dot{\lyap}<0 strictly, V(h(t))\lyap(\statevec(t)) is a decreasing function bounded below by 00, hence convergent; its limit must be a point where V˙=0\dot{\lyap}=0, which is only the origin — asymptotic stability. Radial unboundedness makes every sublevel set compact, so the argument is global. \qquad\blacksquare

The pendulum is the worked example.

Take the energy V(ϕ,ϕ˙)=12ϕ˙2+(g/)(1cosϕ)\lyap(\phi,\dot\phi) = \tfrac12\dot\phi^2 + (g/\ell)(1-\cos\phi), positive definite about the hanging equilibrium. Along the damped dynamics its rate is V˙=dϕ˙20\dot{\lyap} = -d\,\dot\phi^2 \leq 0negative semidefinite, not definite, because it vanishes on the whole line ϕ˙=0\dot\phi = 0, not just at the origin. Theorem 14.1 then gives only stability, not asymptotic stability, even though we know every trajectory decays. The gap is closed by an invariance argument Khalil (2002) .

Theorem 14.2 (LaSalle's invariance principle).

Let Ω\Omega be a compact set, positively invariant under h˙=f(h)\dot{\statevec}=f(\statevec), and let V\lyap be continuously differentiable with V˙0\dot{\lyap}\leq 0 on Ω\Omega. Then every trajectory starting in Ω\Omega converges to the largest invariant set contained in {hΩ:V˙(h)=0}\{\statevec \in \Omega : \dot{\lyap}(\statevec)=0\}.

For the damped pendulum {V˙=0}\{\dot{\lyap}=0\} is {ϕ˙=0}\{\dot\phi=0\}; the only complete trajectory that stays there is the equilibrium (if ϕ˙0\dot\phi\equiv 0 then ϕ¨0\ddot\phi\equiv 0 forces sinϕ=0\sin\phi=0), so LaSalle upgrades stability to asymptotic stability with a semidefinite V˙\dot{\lyap}. The companion confirms it: energy decreases monotonically and the state converges to the origin.

The bridge to Chapters 1 and 13. A Lyapunov function is a value function read backwards. Chapter 1’s vπ\valuefn_\policy satisfies the Bellman equation and decreases in expectation under an improving policy; a Lyapunov V\lyap decreases along every deterministic trajectory, V˙<0\dot{\lyap}<0 — the autonomous, worst-case analog of “the value goes down.” Chapter 13 fused the two: there we proved hPh\statevec^\top\riccati\statevec drops by exactly the stage cost each step, so the optimal cost-to-go is a Lyapunov function. The difference is where the certificate comes from. Optimal control solves the whole Hamilton–Jacobi–Bellman problem and gets V\lyap as a byproduct; Lyapunov design guesses V\lyap (often a physical energy) and only checks a derivative — cheaper, structure-exploiting, and not tied to optimality.

Feedback linearization

Lyapunov’s method certifies; feedback linearization constructs a controller by canceling the nonlinearity outright. For an input-affine system h˙=f(h)+g(h)u\dot{\statevec} = f(\statevec) + g(\statevec)u with output y=h(h)y = h(\statevec), differentiate yy until the input appears.

Definition 14.2 (Relative degree).

The system has relative degree rr at h0\statevec_0 if LgLfkh(h)=0\liederiv_g \liederiv_f^{k}h(\statevec)=0 for k<r1k < r-1 near h0\statevec_0 and LgLfr1h(h0)0\liederiv_g \liederiv_f^{r-1}h(\statevec_0)\neq 0, where the Lie derivative Lfh=hf\liederiv_f h = \nabla h^\top f is the rate of change of hh along ff. Equivalently, rr is the number of differentiations of yy before the input uu appears explicitly.

When the relative degree equals the state dimension, the change of coordinates (h,Lfh,,Lfr1h)(\,h, \liederiv_f h, \dots, \liederiv_f^{r-1}h\,) turns the system into a chain of integrators, and the control

u=1LgLfr1h(h)(Lfrh(h)+v)u = \frac{1}{\liederiv_g \liederiv_f^{r-1}h(\statevec)}\big(-\liederiv_f^{r}h(\statevec) + v\big)

makes the input–output map exactly linear, y(r)=vy^{(r)} = v — then place poles with a linear vv Sastry (1999) . The pendulum is the textbook case of computed torque: with θ¨=asinθdθ˙+bu\ddot\theta = a\sin\theta - d\dot\theta + b\,u, choosing u=1b(asinθ+dθ˙+v)u = \tfrac1b(-a\sin\theta + d\dot\theta + v) cancels gravity and damping, leaving θ¨=v\ddot\theta = v; then v=k1θk2θ˙v = -k_1\theta - k_2\dot\theta gives a chosen second-order linear closed loop Slotine & Li (1991) .

Proposition 14.1 (Computed-torque exactness).

Under the computed-torque law above, the nonlinear closed loop equals the linear system h˙=[01k1k2]h\dot{\statevec} = \big[\begin{smallmatrix}0 & 1\\ -k_1 & -k_2\end{smallmatrix}\big]\statevec exactly, with closed-loop poles the roots of λ2+k2λ+k1\lambda^2 + k_2\lambda + k_1. Stability is by design (choose k1,k2>0k_1,k_2>0), not by linearization.

The companion integrates the true nonlinear plant under this law and the target linear system from the same initial state and finds the trajectories identical to numerical precision — the nonlinearity is gone, not merely small. The cost is honesty about its price: feedback linearization needs an accurate model (it cancels exact terms), it can demand large control authority, and it is only valid where the relative degree is well-defined and the internal dynamics (the unobserved part when r<nr < n) are stable. Cancel a nonlinearity you do not know precisely and the cancellation leaves a residual — which motivates a method that does not depend on exact cancellation.

Sliding-mode control

Sliding-mode control trades smoothness for robustness. Instead of canceling the dynamics, it forces the state onto a designer-chosen surface and keeps it there despite model error, as long as the uncertainty enters through the same channel as the control (matched uncertainty).

Proposition 14.2 (Finite-time reaching).

Let s(h)s(\statevec) define a sliding surface s=0s=0 on which the reduced dynamics are stable, and suppose the control enforces the reaching law s˙=ηsign(s)\dot s = -\eta\,\mathrm{sign}(s) with η>0\eta>0. Then ddt12s2=ηsη2(12s2)1/2\tfrac{d}{dt}\tfrac12 s^2 = -\eta\,\abs{s} \leq -\eta\sqrt{2}\,\big(\tfrac12 s^2\big)^{1/2}, so s\abs{s} reaches 00 in finite time bounded by s(0)/η\abs{s(0)}/\eta, after which the trajectory stays on s=0s=0 and the reduced dynamics carry it to the origin.

For the pendulum, take s=θ˙+λθs = \dot\theta + \lambda\theta (λ>0\lambda>0); on s=0s=0 the reduced dynamics are θ˙=λθ\dot\theta = -\lambda\theta, which decays. The control u=1b(asinθ+(dλ)θ˙ηsat(s/Φ))u = \tfrac1b\big(-a\sin\theta + (d-\lambda)\dot\theta - \eta\,\mathrm{sat}(s/\Phi)\big) drives ss into a boundary layer of width Φ\Phi.

The robustness is the selling point: a wrong gravity estimate still drives s0s\to 0, because the gravity term enters through the same input channel as uu and is dominated by a large enough η\eta Slotine & Li (1991) . The companion verifies all three claims — finite-time reaching within the s(0)/η\abs{s(0)}/\eta bound, the surface maintained inside the boundary layer thereafter, and convergence under a deliberately mismatched model.

Lyapunov is doing the work both times. Computed torque imposes a stable linear Lyapunov function; sliding mode uses 12s2\tfrac12 s^2 as a Lyapunov function for the surface. Each design is a recipe for a certificate — which is exactly what the next two ideas systematize.

Input-to-state stability and backstepping

Real systems have disturbances. Input-to-state stability (ISS) is the right generalization of asymptotic stability to forced systems, and it is stated with comparison functions: a class-K\classK function is continuous, strictly increasing, and zero at zero.

Definition 14.3 (Input-to-state stability).

The system h˙=f(h,u)\dot{\statevec} = f(\statevec, u) is input-to-state stable if there exist a class-KL\classK\mathcal{L} function β\beta and a class-K\classK function γ\gamma such that for every initial state and every bounded input,

h(t)    β(h(0),t)  +  γ(sup0τtu(τ)).\norm{\statevec(t)} \;\leq\; \beta\big(\norm{\statevec(0)},\,t\big) \;+\; \gamma\Big(\sup_{0\leq\tau\leq t}\norm{u(\tau)}\Big).

The state is eventually bounded by a gain γ\gamma of the input size, and decays to zero when the input does.

ISS, due to Sontag, makes “small disturbance, small deviation” precise and composable: a cascade of ISS subsystems is ISS, which is what lets large nonlinear designs be built and certified piece by piece Sontag (1998) . Backstepping is the constructive engine that exploits this. For systems in strict-feedback (cascade) form, it builds the controller and a Lyapunov function recursively: stabilize the first subsystem treating the next state as a virtual control, define the error between that state and its desired value, augment the Lyapunov function with a quadratic in the error, and step inward until the real input appears Kellett & Braun (2023) . The output is a controller and a Lyapunov certificate delivered together — Lyapunov design turned into an algorithm, and the classical counterpart of the “value function as a learnable object” that model-based RL will pursue in Part III.

Nonlinear control versus deep RL

Lay the chapter beside the reinforcement-learning half of the curriculum. Both seek a feedback policy and a scalar certificate of good behavior; they differ in what they assume and what they pay.

  • Nonlinear control assumes a model with structure — input-affine form, a known relative degree, matched uncertainty, a physical energy. Given that structure, it returns a controller with a stability proof and no sampling: computed torque, a sliding surface, a backstepping Lyapunov function. The certificate V\lyap is designed, not learned.
  • Deep RL assumes samples, not structure. It learns v\optvaluefn and π\policy from interaction, paying in data and variance, and buys the ability to handle dynamics no one can write down — contact, friction, pixels — where relative degree and clean cancellations do not exist.

The dividing line is whether the structure is available and trustworthy. When it is, control wins on guarantees and sample cost; when the structure is absent or too hard to obtain, sampling wins on reach. Part III is the synthesis: model predictive control (Week 15) turns a model into a policy by online optimization, and the convergence weeks graft learned value functions onto controllers with Lyapunov-style guarantees — RL’s reach with control’s certificates.

What’s next

  • Week 15 (model predictive control). Rather than design one feedback law, re-solve a finite-horizon optimal control problem at every step and apply the first move. MPC is online approximate dynamic programming: Chapter 13’s Riccati value function becomes a horizon-NN optimization, the Lyapunov certificate of this chapter reappears as a terminal cost, and constraints — which neither LQR nor feedback linearization handle — become first-class.

Exercises

  1. (Derive) For the hanging pendulum with energy V=12ϕ˙2+(g/)(1cosϕ)\lyap = \tfrac12\dot\phi^2 + (g/\ell)(1-\cos\phi) and dynamics ϕ¨=(g/)sinϕdϕ˙\ddot\phi = -(g/\ell)\sin\phi - d\dot\phi, compute V˙=Vf\dot{\lyap} = \nabla\lyap^\top f and show the gravity terms cancel, leaving V˙=dϕ˙2\dot{\lyap} = -d\dot\phi^2.

    Solution

    V=((g/)sinϕ, ϕ˙)\nabla\lyap = \big((g/\ell)\sin\phi,\ \dot\phi\big) and f=(ϕ˙, (g/)sinϕdϕ˙)f = \big(\dot\phi,\ -(g/\ell)\sin\phi - d\dot\phi\big). Their inner product is (g/)sinϕϕ˙+ϕ˙((g/)sinϕdϕ˙)=dϕ˙2(g/\ell)\sin\phi\,\dot\phi + \dot\phi(-(g/\ell)\sin\phi - d\dot\phi) = -d\dot\phi^2. The cross terms cancel; only dissipation remains. For d>0d>0 this is negative semidefinite, so Theorem 14.1 gives stability and LaSalle (Theorem 14.2) upgrades it to asymptotic stability.

  2. (Prove) Show that V˙<0\dot{\lyap}<0 for h0\statevec\neq 0 implies the limit of V(h(t))\lyap(\statevec(t)) is a value at which V˙=0\dot{\lyap}=0, and conclude the origin is asymptotically stable.

    Solution

    V(h(t))\lyap(\statevec(t)) is monotonically decreasing and bounded below by 00, hence convergent to some V0\lyap_\infty\geq 0. If V>0\lyap_\infty>0 the trajectory stays in the compact shell {c1VV(h(0))}\{c_1\leq\lyap\leq\lyap(\statevec(0))\} where V˙μ<0\dot{\lyap}\leq -\mu<0, forcing V\lyap\to-\infty — a contradiction. So V=0\lyap_\infty=0, and by positive definiteness h(t)0\statevec(t)\to0. (This is the strict-decrease half of Theorem 14.1; LaSalle handles the semidefinite case.)

  3. (Compute) A system has relative degree r=2r=2: x˙1=x2\dot{x}_1 = x_2, x˙2=sinx1+u\dot{x}_2 = \sin x_1 + u, y=x1y=x_1. Find the feedback-linearizing control that imposes y¨=k1yk2y˙\ddot y = -k_1 y - k_2\dot y and give the closed-loop poles.

    Solution

    y¨=x˙2=sinx1+u\ddot y = \dot x_2 = \sin x_1 + u, so u=sinx1+vu = -\sin x_1 + v with v=k1x1k2x2v = -k_1 x_1 - k_2 x_2 yields y¨=k1yk2y˙\ddot y = -k_1 y - k_2\dot y. The Lie-derivative bookkeeping: Lfh=x2\liederiv_f h = x_2, Lf2h=sinx1\liederiv_f^2 h = \sin x_1, LgLfh=10\liederiv_g\liederiv_f h = 1\neq0 (relative degree 22). Poles are the roots of λ2+k2λ+k1\lambda^2 + k_2\lambda + k_1; pick k1,k2>0k_1,k_2>0 for a Hurwitz pair.

  4. (Prove) For the reaching law s˙=ηsign(s)\dot s = -\eta\,\mathrm{sign}(s), show s(t)\abs{s(t)} hits zero by time s(0)/η\abs{s(0)}/\eta. Why does matched uncertainty not change this bound?

    Solution

    With W=12s2W=\tfrac12 s^2, W˙=ss˙=ηs=η2W\dot W = s\dot s = -\eta\abs{s} = -\eta\sqrt{2W}. Separating, 2W\sqrt{2W} decreases at constant rate η\eta, so s=2W\abs{s}=\sqrt{2W} reaches 00 in time s(0)/η\abs{s(0)}/\eta. A matched disturbance δ\delta enters as s˙=ηsign(s)+δ\dot s = -\eta\,\mathrm{sign}(s) + \delta; choosing η>supδ\eta > \sup\abs{\delta} keeps ss˙(ηsupδ)s<0s\dot s \leq -(\eta-\sup\abs{\delta})\abs{s} < 0, so the surface is still reached — the gain η\eta dominates the uncertainty rather than canceling it.

  5. (Implement) In the companion, verify the three sliding-mode claims: finite-time reaching within s(0)/η\abs{s(0)}/\eta, the surface maintained in the boundary layer afterward, and convergence under a mismatched model.

    Solution

    See experiments/python/week14/test_nonlinear.py: test_sliding_mode_reaches_surface_in_bounded_time checks the reaching bound and boundary-layer maintenance; test_sliding_mode_robust_to_matched_parameter_error builds the controller with a wrong length (so a wrong gravity coefficient) yet still reaches the surface and converges — matched-uncertainty robustness.

  6. (Extend) Argue why the quadratic LQR cost-to-go hPh\statevec^\top\riccati\statevec of Chapter 13 is a Lyapunov function for the optimal closed loop, and use this to connect Lyapunov design to value functions.

    Solution

    From Chapter 13, along the optimal closed loop hkPhkhk+1Phk+1=hk(Q+KRK)hk0\statevec_k^\top\riccati\statevec_k - \statevec_{k+1}^\top\riccati\statevec_{k+1} = \statevec_k^\top(Q + \lqrgain^\top R\lqrgain)\statevec_k \geq 0, so V=hPh\lyap=\statevec^\top\riccati\statevec is positive definite with ΔV0\Delta\lyap\leq 0 — a Lyapunov function that also equals the optimal value. Lyapunov design chooses such a certificate directly (energy, 12s2\tfrac12 s^2, a backstepping sum) instead of solving the optimal-control problem for it — the same object, reached without the full Hamilton–Jacobi–Bellman computation. This is the seam model-based RL works along in Part III.

Companion code

The Week-14 companion lives at experiments/python/week14/ (pure numpy, RK4 integration with the controller evaluated at each stage for a faithful continuous-time closed loop).

  • nonlinear.py — the pendulum model (hanging and upright conventions); the energy Lyapunov function with its analytic and numeric V˙\dot{\lyap}; a certificate check that passes for the damped pendulum and fails for the anti-damped one; the computed-torque feedback-linearizing law and its target linear system; and the boundary-layer sliding-mode law with the sliding surface and reaching time.
  • test_nonlinear.py — mathematical-correctness tests: V˙=dϕ˙2\dot{\lyap}=-d\dot\phi^2 matches the numeric Vf\nabla\lyap^\top f; the certificate discriminates damped from anti-damped; the damped pendulum’s energy is monotone and the state converges (LaSalle); feedback linearization reproduces the target linear system exactly with the prescribed poles; and sliding mode reaches the surface in bounded time, stays in the boundary layer, and is robust to matched parameter error.
# nonlinear-control algorithms + correctness tests
PYTHONPATH=. pytest experiments/python/week14/test_nonlinear.py -q

# worked Lyapunov / feedback-linearization / sliding-mode demonstrations on the pendulum
PYTHONPATH=. python experiments/python/week14/nonlinear.py