Chapters
Every chapter, grouped by Part. Use the card metadata to calibrate how much trust to place in a chapter's specific claims.
Foundations
-
Markov Decision Processes and Dynamic Programming
The finite MDP, the Bellman expectation and optimality equations, and the gamma-contraction that makes value iteration and policy iteration converge — dynamic programming as the spine the rest of the curriculum returns to.
-
Asynchronous and Prioritized Dynamic Programming
Keeping the gamma-contraction but varying the schedule: asynchronous and Gauss–Seidel value iteration, prioritized sweeping, and real-time DP. Why the order of backups is free — monotonicity plus a constant-shift identity — and how update order and residual size set the practical convergence rate.
-
Monte Carlo Methods
Estimating value from sampled returns when the model is unknown: first-visit Monte Carlo prediction, Monte Carlo control by generalized policy iteration, and off-policy learning by importance sampling. Monte Carlo estimation as quadrature — unbiased, model-free, and high-variance, with the fragility of off-policy correction.
-
Temporal-Difference Learning: TD(0), SARSA, and Q-Learning
Bootstrapping from sampled transitions: the TD(0) prediction update as a stochastic Euler step toward the Bellman fixed point, SARSA as on-policy control, and Q-learning as off-policy control. Where temporal-difference learning sits between Monte Carlo and dynamic programming, and what the SARSA/Q-learning split reveals on the cliff.
-
Function Approximation and the Deadly Triad
Replacing the value table with a parametric approximator: linear value functions, semi-gradient TD, and the projected Bellman operator. Why on-policy semi-gradient TD converges to a bounded-error fixed point, and why function approximation, bootstrapping, and off-policy training together can diverge — the deadly triad, witnessed by Baird's counterexample.
-
The DQN Family
Making approximate Q-learning stable enough for pixels: experience replay and target networks as direct countermeasures to the deadly triad, the overestimation bias that motivates Double DQN, and the dueling, prioritized, and Rainbow refinements. The engineering that scaled value-based RL to Atari.
-
Policy Gradient Foundations
Optimizing a parameterized stochastic policy directly by gradient ascent on expected return: the policy gradient theorem via the log-derivative trick, REINFORCE, and baselines as variance-reducing control variates. Policy gradients as Monte Carlo sensitivity analysis — and the advantage that bridges to actor-critic.
-
Actor-Critic, GAE, PPO, and TRPO
Turning REINFORCE into a stable, sample-reusing optimizer: actor-critic with a learned baseline, generalized advantage estimation as a bias–variance dial, and trust regions (TRPO/PPO) as step-size control in policy space. Why the clipped surrogate works, and why implementation details decide the score.
-
Off-Policy Continuous Control: DDPG, TD3, and SAC
Off-policy actor-critic for continuous actions: the deterministic policy gradient (DDPG), the twin-critic overestimation fix (TD3), and maximum-entropy RL (SAC). Why the soft-optimal policy is Boltzmann in the action-value, and how maximum-entropy RL is KL-regularized optimal control — the bridge from learning to Part II.
-
Reinforcement Learning in the Real World: A Robotics Survey
Where reinforcement learning has actually worked on real hardware — dexterous manipulation, legged locomotion, champion drone racing, and tokamak plasma control — and the recipe behind those successes: abundant simulation, domain randomization, and RL reserved for where models are hard or adaptation is required. The empirical case for RL, and why it sets up Part II.
Control
-
State-Space Models and Transfer Functions
The entry to control theory: the linear state-space model, the transfer function, and their equivalence. Two views of one dynamical system, the state-similarity invariance that makes the transfer function the coordinate-free object, and the discrete-time model that is exactly the MDP dynamics control assumes known.
-
Stability, Controllability, and Observability
The structural properties of a linear model knowable before any controller exists: internal stability via eigenvalues and the Lyapunov equation, controllability and observability as reachability and state-identifiability, and the duality that makes them one theory — and makes the LQR regulator and the Kalman estimator one computation.
-
Linear-Quadratic Regulation: The Exact Dynamic Program
The linear-quadratic regulator as exact dynamic programming: a quadratic value function, the Riccati recursion as Chapter 1's Bellman optimality equation in coordinates, the linear-feedback optimal policy, the infinite-horizon algebraic Riccati equation, and the LQG separation principle — with Doyle's warning that optimal output feedback carries no guaranteed stability margins.
-
Nonlinear Control: Lyapunov Design, Feedback Linearization, and Sliding Modes
When the plant is nonlinear, eigenvalues and Riccati equations describe only a local picture. This chapter builds the tools that replace them: Lyapunov's direct method and LaSalle's invariance principle as global stability certificates, feedback linearization that cancels the nonlinearity by coordinate change, sliding-mode control that enforces a surface in finite time and is robust to matched uncertainty, and input-to-state stability and backstepping as the constructive bridge to robust and adaptive design — with the Lyapunov function read as the control-theoretic cousin of the reinforcement-learning value function.