Chapters

Every chapter, grouped by Part. Use the card metadata to calibrate how much trust to place in a chapter's specific claims.

Foundations

  1. Week 1 implemented

    Markov Decision Processes and Dynamic Programming

    The finite MDP, the Bellman expectation and optimality equations, and the gamma-contraction that makes value iteration and policy iteration converge — dynamic programming as the spine the rest of the curriculum returns to.

  2. Week 2 implemented

    Asynchronous and Prioritized Dynamic Programming

    Keeping the gamma-contraction but varying the schedule: asynchronous and Gauss–Seidel value iteration, prioritized sweeping, and real-time DP. Why the order of backups is free — monotonicity plus a constant-shift identity — and how update order and residual size set the practical convergence rate.

  3. Week 3 implemented

    Monte Carlo Methods

    Estimating value from sampled returns when the model is unknown: first-visit Monte Carlo prediction, Monte Carlo control by generalized policy iteration, and off-policy learning by importance sampling. Monte Carlo estimation as quadrature — unbiased, model-free, and high-variance, with the fragility of off-policy correction.

  4. Week 4 implemented

    Temporal-Difference Learning: TD(0), SARSA, and Q-Learning

    Bootstrapping from sampled transitions: the TD(0) prediction update as a stochastic Euler step toward the Bellman fixed point, SARSA as on-policy control, and Q-learning as off-policy control. Where temporal-difference learning sits between Monte Carlo and dynamic programming, and what the SARSA/Q-learning split reveals on the cliff.

  5. Week 5 implemented

    Function Approximation and the Deadly Triad

    Replacing the value table with a parametric approximator: linear value functions, semi-gradient TD, and the projected Bellman operator. Why on-policy semi-gradient TD converges to a bounded-error fixed point, and why function approximation, bootstrapping, and off-policy training together can diverge — the deadly triad, witnessed by Baird's counterexample.

  6. Week 6 implemented

    The DQN Family

    Making approximate Q-learning stable enough for pixels: experience replay and target networks as direct countermeasures to the deadly triad, the overestimation bias that motivates Double DQN, and the dueling, prioritized, and Rainbow refinements. The engineering that scaled value-based RL to Atari.

  7. Week 7 implemented

    Policy Gradient Foundations

    Optimizing a parameterized stochastic policy directly by gradient ascent on expected return: the policy gradient theorem via the log-derivative trick, REINFORCE, and baselines as variance-reducing control variates. Policy gradients as Monte Carlo sensitivity analysis — and the advantage that bridges to actor-critic.

  8. Week 8 implemented

    Actor-Critic, GAE, PPO, and TRPO

    Turning REINFORCE into a stable, sample-reusing optimizer: actor-critic with a learned baseline, generalized advantage estimation as a bias–variance dial, and trust regions (TRPO/PPO) as step-size control in policy space. Why the clipped surrogate works, and why implementation details decide the score.

  9. Week 9 implemented

    Off-Policy Continuous Control: DDPG, TD3, and SAC

    Off-policy actor-critic for continuous actions: the deterministic policy gradient (DDPG), the twin-critic overestimation fix (TD3), and maximum-entropy RL (SAC). Why the soft-optimal policy is Boltzmann in the action-value, and how maximum-entropy RL is KL-regularized optimal control — the bridge from learning to Part II.

  10. Week 10 chapter_only

    Reinforcement Learning in the Real World: A Robotics Survey

    Where reinforcement learning has actually worked on real hardware — dexterous manipulation, legged locomotion, champion drone racing, and tokamak plasma control — and the recipe behind those successes: abundant simulation, domain randomization, and RL reserved for where models are hard or adaptation is required. The empirical case for RL, and why it sets up Part II.

Control

  1. Week 11 implemented

    State-Space Models and Transfer Functions

    The entry to control theory: the linear state-space model, the transfer function, and their equivalence. Two views of one dynamical system, the state-similarity invariance that makes the transfer function the coordinate-free object, and the discrete-time model that is exactly the MDP dynamics control assumes known.

  2. Week 12 implemented

    Stability, Controllability, and Observability

    The structural properties of a linear model knowable before any controller exists: internal stability via eigenvalues and the Lyapunov equation, controllability and observability as reachability and state-identifiability, and the duality that makes them one theory — and makes the LQR regulator and the Kalman estimator one computation.

  3. Week 13 implemented

    Linear-Quadratic Regulation: The Exact Dynamic Program

    The linear-quadratic regulator as exact dynamic programming: a quadratic value function, the Riccati recursion as Chapter 1's Bellman optimality equation in coordinates, the linear-feedback optimal policy, the infinite-horizon algebraic Riccati equation, and the LQG separation principle — with Doyle's warning that optimal output feedback carries no guaranteed stability margins.

  4. Week 14 implemented

    Nonlinear Control: Lyapunov Design, Feedback Linearization, and Sliding Modes

    When the plant is nonlinear, eigenvalues and Riccati equations describe only a local picture. This chapter builds the tools that replace them: Lyapunov's direct method and LaSalle's invariance principle as global stability certificates, feedback linearization that cancels the nonlinearity by coordinate change, sliding-mode control that enforces a surface in finite time and is robust to matched uncertainty, and input-to-state stability and backstepping as the constructive bridge to robust and adaptive design — with the Lyapunov function read as the control-theoretic cousin of the reinforcement-learning value function.