The DQN Family
Making approximate Q-learning stable enough for pixels: experience replay and target networks as direct countermeasures to the deadly triad, the overestimation bias that motivates Double DQN, and the dueling, prioritized, and Rainbow refinements. The engineering that scaled value-based RL to Atari.
On this page
The DQN Family
Where we are. Chapter 4 gave us Q-learning; Chapter 5 warned that combining bootstrapping, off-policy data, and function approximation — the deadly triad — can diverge. The deep Q-network is that exact combination (Q-learning, off-policy, with a neural-network approximator), and naively it does diverge. What made it work on Atari from pixels — first as a 2013 workshop result Mnih et al. (2013) , then at human level Mnih et al. (2015) — was not new theory but two pieces of engineering that tame the triad’s two instabilities: experience replay and a target network. This chapter is about that engineering — and the overestimation bias that the next refinement, Double DQN, corrects.
From Q-learning to a deep Q-network
Replace the Q-table with a network and fit it to the sampled Bellman optimality target. DQN minimizes, over transitions drawn from a replay buffer ,
where are the target-network weights. As in Chapter 5 this is a semi-gradient method: we differentiate only , treating the target as fixed. Two structural problems would sink naive online training, and DQN answers each:
- Correlated data. Consecutive transitions in a trajectory are highly correlated, violating the i.i.d. assumption gradient descent leans on.
- A moving target. The regression target uses the same network being updated, so every gradient step shifts the target it was chasing — the bootstrap instability of the triad.
Experience replay
The first fix stores each transition in a fixed-capacity buffer and trains on uniformly sampled minibatches rather than the latest transition. Three benefits follow: sampling across many past episodes decorrelates the minibatch toward the i.i.d. regime; each transition is reused many times, improving sample efficiency; and the update distribution becomes the buffer’s mixture rather than the current policy’s trajectory, re-weighting away from the pathological off-policy distributions that drive divergence. The idea is old — it is the model-free, sampled descendant of the prioritized sweeping of Chapter 2 — and its prioritized variant returns shortly.
Target networks
The second fix freezes the bootstrap. DQN keeps a separate copy of the network, holds it fixed while training , and refreshes only every steps (a hard update) or by a slow Polyak average (a soft update). Between refreshes the target is a fixed function, so each interval is an ordinary supervised regression toward a stationary target — exactly the stability the moving-target triad destroyed. The slower moves, the more stable and the slower the learning — the central DQN tuning trade-off.
Overestimation and Double DQN
One bias survives even with replay and a target network: the in the target overestimates. Because the network’s action-values are noisy estimates, taking their maximum is systematically too high.
Let be unbiased estimates of the true values , i.e. for each . Then
with strict inequality whenever two or more actions’ estimates have positive-variance overlap. The bootstrap target therefore inherits a positive bias.
The function is convex (a pointwise maximum of linear maps). Jensen’s inequality for a convex function gives , and the right side is by unbiasedness. Jensen is an equality only where the convex function is affine along the distribution’s support; the has a kink exactly where the maximizing action changes, so any noise that makes the random makes the inequality strict.
Double DQN Hasselt et al. (2016) removes most of this bias by decoupling action selection from evaluation: pick the next action with the online network but evaluate it with the target network,
The selecting and evaluating estimates now have independent errors, so they no longer conspire to inflate the maximum — a one-line change that measurably improves scores.
Dueling, prioritized replay, and Rainbow
Three further refinements round out the family. The dueling architecture Wang et al. (2016) splits the network into a state-value stream and an advantage stream , recombined as , so the agent can learn a state is good without estimating every action precisely. Prioritized experience replay Schaul et al. (2016) samples transitions in proportion to their Bellman-residual magnitude — Chapter 2’s prioritized sweeping, now over replayed transitions — with importance weights to correct the sampling bias. Rainbow Hessel et al. (2018) combines six such improvements (double, dueling, prioritized replay, multi-step returns, distributional values, and noisy exploration) and ablates each, showing they are largely complementary.
From states to pixels
DQN’s headline result was learning from raw Atari frames on the Arcade Learning Environment Bellemare et al. (2013) . The pixel pipeline adds its own engineering: grayscale and downsample each frame, stack four frames so velocity is observable (a single frame is not Markov), skip frames, clip rewards to , and read the stack with a convolutional network. None of this changes the algorithm — it changes the features the same loss is regressed on. Reliable comparison across all this machinery is itself hard — reproducibility studies and standardized implementations Raffin et al. (2021) exist precisely because small details swing results, a theme Week 8 returns to.
The dynamic-programming bridge
DQN is the deadly triad (Chapter 5) survived by engineering rather than dissolved by theory. The regression target is still the sampled Bellman optimality backup of Chapter 1; the two additions each blunt one edge of the triad. Replay re-weights and decorrelates the update distribution — the off-policy edge — and is the model-free heir to prioritized sweeping (Chapter 2). The target network freezes the bootstrap operator for steps — the moving-target edge — converting divergent fixed-point chasing into a sequence of stationary regressions. The discount that guaranteed convergence with a table now only bounds the per-interval target; stability is bought, not proved.
What’s next
- Week 7 changes the object of optimization entirely: instead of learning values and acting greedily, policy-gradient methods parameterize and optimize the policy directly, sidestepping the and its overestimation, and extending naturally to continuous actions.
Exercises
-
(Prove) Show for unbiased , and state when the inequality is strict (Prop. 6.1).
Solution
is convex, so Jensen gives . It is strict whenever the noise makes the random (two actions’ estimates overlap with positive probability), because the is non-affine across the kink where the maximizer switches.
-
(Derive) Write the Double DQN target and explain why it reduces the overestimation of Proposition 6.1.
Solution
. Selection uses , evaluation uses ; their estimation errors are (largely) independent, so the action chosen as best by one network is not automatically assigned an inflated value by the same network. The bias becomes the much smaller bias of evaluating a possibly-suboptimal action.
-
(Compute) A replay buffer of capacity 3 receives transitions in order. Which are stored after , and why does a hard target update at step leave momentarily equal to ?
Solution
A circular buffer of capacity 3 keeps the three most recent, ( overwritten). A hard update copies , so immediately after step the two networks are identical; they diverge again as updates over the next steps while is held fixed.
-
(Implement) In the companion, verify the replay buffer, target-update, and TD-target components, and that DQN learns CartPole well above the random-return baseline within a fixed step budget.
Solution
See
experiments/python/week06/test_dqn.py: circular-buffer overwrite and sample shapes; hard/soft target updates; the done-masked TD target and the Double-DQN selection/evaluation split; the empirical overestimation of the plain max; and a seeded CartPole run whose mean return rises far above the ~22 random baseline. -
(Extend) Add Double and Dueling variants and measure the overestimation gap (the plain-max target minus the Double target) over training.
Solution
The companion exposes
doubleandduelingflags; the plain-max target sits above the Double target early in training (when value estimates are noisiest) and the gap shrinks as the network sharpens — the empirical face of Proposition 6.1.
Companion code
The Week-6 companion lives at experiments/python/week06/ and is the chapter’s first
PyTorch code. Its correctness suite follows the repo’s deep-RL convention: fast,
deterministic component tests for the pieces, plus a seeded simple-environment
convergence check — heavy pixel environments are a deferred @slow showcase, not a
graded test, in line with the 8 GB GPU budget.
dqn.py— a minimal DQN onCartPole-v1: a circularReplayBuffer, an MLPQNetwork(and aDuelingQNetwork), an exposedtd_target(plain and Double), hard/soft target updates, and the training loop, withdouble/duelingflags.test_dqn.py— component-correctness tests (replay overwrite + sample shapes; target-network hard/soft updates; the done-masked Bellman target; the Double-DQN selection/evaluation split; the max-overestimation of Prop. 6.1) plus a seeded CartPole run asserting the mean return clears the random baseline by a wide margin.
# component tests + a seeded CartPole learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week06/test_dqn.py -q
# worked CartPole training run (prints the learning curve summary)
PYTHONPATH=. python experiments/python/week06/dqn.py --double --episodes 400