⌕

Book A scaffold-astro book

All chapters References

Part I · Foundations

W01 Markov Decision Processes and Dynamic Programming
W02 Asynchronous and Prioritized Dynamic Programming
W03 Monte Carlo Methods
W04 Temporal-Difference Learning: TD(0), SARSA, and Q-Learning
W05 Function Approximation and the Deadly Triad
W06 The DQN Family
W07 Policy Gradient Foundations
W08 Actor-Critic, GAE, PPO, and TRPO
W09 Off-Policy Continuous Control: DDPG, TD3, and SAC
W10 Reinforcement Learning in the Real World: A Robotics Survey

control

W11 State-Space Models and Transfer Functions
W12 Stability, Controllability, and Observability
W13 Linear-Quadratic Regulation: The Exact Dynamic Program
W14 Nonlinear Control: Lyapunov Design, Feedback Linearization, and Sliding Modes

Chapters

Every chapter, grouped by Part. Use the card metadata to calibrate how much trust to place in a chapter's specific claims.

Foundations

Week 1 implemented

Markov Decision Processes and Dynamic Programming

The finite MDP, the Bellman expectation and optimality equations, and the gamma-contraction that makes value iteration and policy iteration converge — dynamic programming as the spine the rest of the curriculum returns to.
Week 2 implemented

Asynchronous and Prioritized Dynamic Programming

Keeping the gamma-contraction but varying the schedule: asynchronous and Gauss–Seidel value iteration, prioritized sweeping, and real-time DP. Why the order of backups is free — monotonicity plus a constant-shift identity — and how update order and residual size set the practical convergence rate.
Week 3 implemented

Monte Carlo Methods

Estimating value from sampled returns when the model is unknown: first-visit Monte Carlo prediction, Monte Carlo control by generalized policy iteration, and off-policy learning by importance sampling. Monte Carlo estimation as quadrature — unbiased, model-free, and high-variance, with the fragility of off-policy correction.
Week 4 implemented

Temporal-Difference Learning: TD(0), SARSA, and Q-Learning

Bootstrapping from sampled transitions: the TD(0) prediction update as a stochastic Euler step toward the Bellman fixed point, SARSA as on-policy control, and Q-learning as off-policy control. Where temporal-difference learning sits between Monte Carlo and dynamic programming, and what the SARSA/Q-learning split reveals on the cliff.
Week 5 implemented

Function Approximation and the Deadly Triad

Replacing the value table with a parametric approximator: linear value functions, semi-gradient TD, and the projected Bellman operator. Why on-policy semi-gradient TD converges to a bounded-error fixed point, and why function approximation, bootstrapping, and off-policy training together can diverge — the deadly triad, witnessed by Baird's counterexample.
Week 6 implemented

The DQN Family

Making approximate Q-learning stable enough for pixels: experience replay and target networks as direct countermeasures to the deadly triad, the overestimation bias that motivates Double DQN, and the dueling, prioritized, and Rainbow refinements. The engineering that scaled value-based RL to Atari.
Week 7 implemented

Policy Gradient Foundations

Optimizing a parameterized stochastic policy directly by gradient ascent on expected return: the policy gradient theorem via the log-derivative trick, REINFORCE, and baselines as variance-reducing control variates. Policy gradients as Monte Carlo sensitivity analysis — and the advantage that bridges to actor-critic.
Week 8 implemented

Actor-Critic, GAE, PPO, and TRPO

Turning REINFORCE into a stable, sample-reusing optimizer: actor-critic with a learned baseline, generalized advantage estimation as a bias–variance dial, and trust regions (TRPO/PPO) as step-size control in policy space. Why the clipped surrogate works, and why implementation details decide the score.
Week 9 implemented

Off-Policy Continuous Control: DDPG, TD3, and SAC

Off-policy actor-critic for continuous actions: the deterministic policy gradient (DDPG), the twin-critic overestimation fix (TD3), and maximum-entropy RL (SAC). Why the soft-optimal policy is Boltzmann in the action-value, and how maximum-entropy RL is KL-regularized optimal control — the bridge from learning to Part II.
Week 10 chapter_only

Reinforcement Learning in the Real World: A Robotics Survey

Where reinforcement learning has actually worked on real hardware — dexterous manipulation, legged locomotion, champion drone racing, and tokamak plasma control — and the recipe behind those successes: abundant simulation, domain randomization, and RL reserved for where models are hard or adaptation is required. The empirical case for RL, and why it sets up Part II.

Control

Week 11 implemented

State-Space Models and Transfer Functions

The entry to control theory: the linear state-space model, the transfer function, and their equivalence. Two views of one dynamical system, the state-similarity invariance that makes the transfer function the coordinate-free object, and the discrete-time model that is exactly the MDP dynamics control assumes known.
Week 12 implemented

Stability, Controllability, and Observability

The structural properties of a linear model knowable before any controller exists: internal stability via eigenvalues and the Lyapunov equation, controllability and observability as reachability and state-identifiability, and the duality that makes them one theory — and makes the LQR regulator and the Kalman estimator one computation.
Week 13 implemented

Linear-Quadratic Regulation: The Exact Dynamic Program

The linear-quadratic regulator as exact dynamic programming: a quadratic value function, the Riccati recursion as Chapter 1's Bellman optimality equation in coordinates, the linear-feedback optimal policy, the infinite-horizon algebraic Riccati equation, and the LQG separation principle — with Doyle's warning that optimal output feedback carries no guaranteed stability margins.
Week 14 implemented

Nonlinear Control: Lyapunov Design, Feedback Linearization, and Sliding Modes

When the plant is nonlinear, eigenvalues and Riccati equations describe only a local picture. This chapter builds the tools that replace them: Lyapunov's direct method and LaSalle's invariance principle as global stability certificates, feedback linearization that cancels the nonlinearity by coordinate change, sliding-mode control that enforces a surface in finite time and is robust to matched uncertainty, and input-to-state stability and backstepping as the constructive bridge to robust and adaptive design — with the Lyapunov function read as the control-theoretic cousin of the reinforcement-learning value function.