Part I · Foundations Week 10 Published

Reinforcement Learning in the Real World: A Robotics Survey

Where reinforcement learning has actually worked on real hardware — dexterous manipulation, legged locomotion, champion drone racing, and tokamak plasma control — and the recipe behind those successes: abundant simulation, domain randomization, and RL reserved for where models are hard or adaptation is required. The empirical case for RL, and why it sets up Part II.

On this page
  1. The cases
  2. An evidence table
  3. The empirical case for RL
  4. The simulation paradox, and the bridge to Part II
  5. What’s next
  6. Exercises

Reinforcement Learning in the Real World: A Robotics Survey

Where we are. Part I built the algorithms — dynamic programming, Monte Carlo and temporal difference, deep value learning, and policy-gradient and actor-critic methods through continuous control. This capstone asks the empirical question those chapters deferred: where has reinforcement learning actually worked on real hardware, and what conditions made it work? The honest answer is both encouraging and narrowing — a handful of genuine, high-profile successes that share a remarkably consistent recipe, and which together explain when to reach for RL and when classical control still wins. This is a reading-and-synthesis week: no new algorithm, a small evidence table, and the argument it supports.

The cases

Dexterous manipulation. OpenAI’s Rubik’s-Cube hand Akkaya et al. (2019) trained a Shadow Hand entirely in simulation and transferred to physical hardware via automatic domain randomization (ADR) — progressively widening the distribution of simulated physics (friction, masses, latencies) until the real robot looked like just another sample. The policy solved a Rubik’s cube one-handed despite never touching a real cube during training.

Legged locomotion. The most mature success story. Lee et al. Lee et al. (2020) trained a blind quadruped to traverse challenging terrain by teacher–student distillation: a privileged teacher with full state trains a proprioception-only student deployable on hardware. Miki et al. Miki et al. (2022) added exteroception, producing robust perceptive locomotion over stairs, gaps, and obstacles outdoors. Radosavovic et al. Radosavovic et al. (2024) carried the same sim-to-real recipe to a full-size humanoid, walking zero-shot in outdoor environments.

Agile flight. Kaufmann et al. Kaufmann et al. (2023) trained an RL policy that beat human world champions at physical first-person-view drone racing — a regime where split-second control at the edge of the dynamics envelope defeats hand-tuned controllers.

Scientific and industrial control. The standout non-locomotion case: Degrave et al. Degrave et al. (2022) used deep RL to magnetically control the plasma shape in a real tokamak, coordinating dozens of control coils against many simultaneous constraints — a problem where accurate first-principles control is genuinely hard and the payoff of learned control is large.

Human-in-the-loop methods (collecting corrective feedback during training) are an active frontier for precise manipulation, pushing RL toward sub-millimeter industrial tasks; the cross-cutting survey of deployed systems by Tang et al. Tang et al. (2025) catalogues these and the recurring sim-to-real patterns.

An evidence table

The roadmap’s deliverable for this week is an evidence table, not code — laying the cases side by side exposes the shared recipe.

| System | Task | Trained in | Real hardware | Sim-to-real / adaptation | Control baseline beaten | |---|---|---|---|---|---| | Rubik’s hand Akkaya et al. (2019) | dexterous manipulation | MuJoCo sim | Shadow Hand | automatic domain randomization | no prior model-based in-hand dexterity controller | | Quadruped Lee et al. (2020) | blind rough-terrain locomotion | rigid-body sim | ANYmal | teacher–student distillation | model-based gait controllers | | Quadruped (wild) Miki et al. (2022) | perceptive locomotion | rigid-body sim | ANYmal | proprio + exteroception, randomization | model-based perceptive control | | Humanoid Radosavovic et al. (2024) | bipedal walking | sim | full-size humanoid | zero-shot sim-to-real | model-based whole-body control | | Drone racing Kaufmann et al. (2023) | agile FPV flight | sim + identified dynamics | racing quadrotor | system identification + randomization | human world champions (and prior autonomous baselines) | | Tokamak Degrave et al. (2022) | plasma magnetic control | physics simulator | TCV tokamak | sim-to-real on a calibrated model | conventional multi-loop controllers |

The empirical case for RL

Read down the table and the same three conditions recur — the conditions under which RL is the right tool Tang et al. (2025) :

  1. Model accuracy is hard. Contact-rich locomotion, dexterous manipulation, and plasma dynamics resist accurate first-principles models. RL learns from interaction what is painful to derive.
  2. Adaptation is required. Terrain, disturbances, and hardware variation demand a policy robust across conditions; domain randomization turns that need into a training distribution the policy generalizes over.
  3. Simulation is abundant. Every case generates its training data in a fast, parallel simulator — the millions of samples RL needs are free there and ruinously expensive on hardware.

Where these fail to hold — no faithful simulator, safety-critical systems with no simulation budget, or problems with good analytic models — classical and optimal control remain the better choice. The successes are real but conditional, and naming the conditions is the point of the survey.

The simulation paradox, and the bridge to Part II

The deepest lesson is a paradox. These are triumphs of model-free policy learning — the deployed policy plans through no learned dynamics model — yet not one of them learned on hardware from scratch. Each trained inside a simulator (a model) and randomized it to cross the reality gap. Model-free RL conquers the physical world precisely by leaning on an abundant, deliberately imperfect model. Stated that way, the question Part II answers comes into focus: when you already have a good model, why learn around it? Control theory — Lyapunov stability, the LQR’s closed-form optimal feedback, model predictive control’s online re-optimization — exploits the model directly, with guarantees RL cannot offer. Part III then fuses the two: MPC that learns its model or cost, and RL warm-started or constrained by a controller. The robotics successes are where the model-free spine of Part I meets reality; the model-based spine of Part II is the other half of the same fixed point.

What’s next

  • Part II (Week 11+) changes register from learning to control theory: dynamical systems and Lyapunov stability, the linear-quadratic regulator as the exactly-solvable optimal control problem (the Bellman fixed point in closed form), and model predictive control as online approximate dynamic programming. The discount-contraction spine of Part I reappears as the Riccati equation and the receding horizon.

Exercises

  1. (Compute) Add one more deployed RL system to the evidence table (e.g. a warehouse, autonomous-driving, or data-center cooling deployment) and fill all six columns from its paper. Which of the three conditions does it satisfy?

    Solution

    Most deployed cases satisfy all three (hard model, adaptation, abundant sim); data-center cooling is the interesting partial case — a learned model substitutes for a hard-to-build simulator, stretching condition 3. The exercise is to make the classification explicit and defend it from the paper’s methods section.

  2. (Derive) From the six cases, state the three enabling conditions precisely, and give a concrete robotics task that fails at least one — predicting that classical control should win there.

    Solution

    A safety-critical task with no faithful simulator and a tight real-data budget (e.g. a one-off surgical manipulator) fails conditions 1 and 3: RL’s sample appetite cannot be met and the reality gap cannot be closed, so model-based / optimal control with formal guarantees is the appropriate tool.

  3. (Extend) Implement a tiny domain-randomization toy: train a policy (e.g. the Week-7 REINFORCE or Week-9 TD3 companion) on a Gymnasium env whose dynamics are randomized each episode (mass, gravity, or action scale), and measure its robustness to a held-out dynamics setting versus a policy trained without randomization.

    Solution

    Randomizing a physics parameter each reset trains a policy over a distribution of dynamics; on a held-out setting it should degrade far less than the non-randomized policy, reproducing in miniature the sim-to-real mechanism every case above relies on. This is the one optional code task of an otherwise reading-only week.

  4. (Extend) Pick one case and argue where a Part-II model-based controller (LQR or MPC) could replace or augment the learned policy. What would each approach require?

    Solution

    The tokamak is the natural candidate: a calibrated plasma model already exists, so an MPC could in principle re-optimize the coil currents online — at the cost of solving a constrained optimization every control step, which is exactly what RL amortizes into a fast policy. The trade is online compute and model fidelity (MPC) versus training cost and the reality gap (RL) — the Part III hybrid uses both.