Part I · Foundations Week 6 Published dqn.py test_dqn.py

The DQN Family

Making approximate Q-learning stable enough for pixels: experience replay and target networks as direct countermeasures to the deadly triad, the overestimation bias that motivates Double DQN, and the dueling, prioritized, and Rainbow refinements. The engineering that scaled value-based RL to Atari.

On this page
  1. From Q-learning to a deep Q-network
  2. Experience replay
  3. Target networks
  4. Overestimation and Double DQN
  5. Dueling, prioritized replay, and Rainbow
  6. From states to pixels
  7. The dynamic-programming bridge
  8. What’s next
  9. Exercises
  10. Companion code

The DQN Family

Where we are. Chapter 4 gave us Q-learning; Chapter 5 warned that combining bootstrapping, off-policy data, and function approximation — the deadly triad — can diverge. The deep Q-network is that exact combination (Q-learning, off-policy, with a neural-network approximator), and naively it does diverge. What made it work on Atari from pixels — first as a 2013 workshop result Mnih et al. (2013) , then at human level Mnih et al. (2015) — was not new theory but two pieces of engineering that tame the triad’s two instabilities: experience replay and a target network. This chapter is about that engineering — and the overestimation bias that the next refinement, Double DQN, corrects.

From Q-learning to a deep Q-network

Replace the Q-table with a network Qθ(s,a)Q_\theta(s,a) and fit it to the sampled Bellman optimality target. DQN minimizes, over transitions (s,a,r,s)(s,a,r,s') drawn from a replay buffer D\mathcal{D},

L(θ)=E(s,a,r,s)D[(r+γmaxaQθ(s,a)Qθ(s,a))2],L(\theta) = \E_{(s,a,r,s')\sim\mathcal{D}}\Big[\big(\, r + \discount \max_{a'} Q_{\theta^-}(s',a') - Q_\theta(s,a)\,\big)^2\Big],

where θ\theta^- are the target-network weights. As in Chapter 5 this is a semi-gradient method: we differentiate only Qθ(s,a)Q_\theta(s,a), treating the target as fixed. Two structural problems would sink naive online training, and DQN answers each:

  • Correlated data. Consecutive transitions in a trajectory are highly correlated, violating the i.i.d. assumption gradient descent leans on.
  • A moving target. The regression target uses the same network being updated, so every gradient step shifts the target it was chasing — the bootstrap instability of the triad.

Experience replay

The first fix stores each transition (s,a,r,s)(s,a,r,s') in a fixed-capacity buffer D\mathcal{D} and trains on uniformly sampled minibatches rather than the latest transition. Three benefits follow: sampling across many past episodes decorrelates the minibatch toward the i.i.d. regime; each transition is reused many times, improving sample efficiency; and the update distribution becomes the buffer’s mixture rather than the current policy’s trajectory, re-weighting away from the pathological off-policy distributions that drive divergence.

The idea is old — it is the model-free, sampled descendant of the prioritized sweeping of Chapter 2 — and its prioritized variant returns shortly.

Target networks

The second fix freezes the bootstrap. DQN keeps a separate copy QθQ_{\theta^-} of the network, holds it fixed while training QθQ_\theta, and refreshes θθ\theta^- \leftarrow \theta only every CC steps (a hard update) or by a slow Polyak average θτθ+(1τ)θ\theta^- \leftarrow \tau\theta + (1-\tau)\theta^- (a soft update). Between refreshes the target r+γmaxaQθ(s,a)r + \discount\max_{a'}Q_{\theta^-}(s',a') is a fixed function, so each interval is an ordinary supervised regression toward a stationary target — exactly the stability the moving-target triad destroyed.

The slower θ\theta^- moves, the more stable and the slower the learning — the central DQN tuning trade-off.

Overestimation and Double DQN

One bias survives even with replay and a target network: the max\max in the target overestimates. Because the network’s action-values are noisy estimates, taking their maximum is systematically too high.

Proposition 6.1 (The max overestimates).

Let Q^(s,)\widehat{Q}(s',\cdot) be unbiased estimates of the true values q(s,)q(s',\cdot), i.e. E[Q^(s,a)]=q(s,a)\E[\widehat{Q}(s',a)] = q(s',a) for each aa. Then

E[maxaQ^(s,a)]    maxaE[Q^(s,a)]=maxaq(s,a),\E\big[\max_{a}\widehat{Q}(s',a)\big] \;\ge\; \max_{a}\E\big[\widehat{Q}(s',a)\big] = \max_a q(s',a),

with strict inequality whenever two or more actions’ estimates have positive-variance overlap. The bootstrap target therefore inherits a positive bias.

Proof.

The function xmaxaxax \mapsto \max_a x_a is convex (a pointwise maximum of linear maps). Jensen’s inequality for a convex function gives E[maxaQ^(s,a)]maxaE[Q^(s,a)]\E[\max_a \widehat{Q}(s',a)] \ge \max_a \E[\widehat{Q}(s',a)], and the right side is maxaq(s,a)\max_a q(s',a) by unbiasedness. Jensen is an equality only where the convex function is affine along the distribution’s support; the max\max has a kink exactly where the maximizing action changes, so any noise that makes the argmax\arg\max random makes the inequality strict. \qquad\blacksquare

Double DQN Hasselt et al. (2016) removes most of this bias by decoupling action selection from evaluation: pick the next action with the online network but evaluate it with the target network,

yDouble=r+γQθ ⁣(s,arg maxaQθ(s,a)).y^{\text{Double}} = r + \discount\, Q_{\theta^-}\!\big(s',\, \argmax_{a'} Q_\theta(s',a')\big).

The selecting and evaluating estimates now have independent errors, so they no longer conspire to inflate the maximum — a one-line change that measurably improves scores.

Dueling, prioritized replay, and Rainbow

Three further refinements round out the family. The dueling architecture Wang et al. (2016) splits the network into a state-value stream V(s)V(s) and an advantage stream A(s,a)A(s,a), recombined as Q(s,a)=V(s)+(A(s,a)1AaA(s,a))Q(s,a) = V(s) + \big(A(s,a) - \tfrac{1}{\lvert\actionspace\rvert}\sum_{a'}A(s,a')\big), so the agent can learn a state is good without estimating every action precisely. Prioritized experience replay Schaul et al. (2016) samples transitions in proportion to their Bellman-residual magnitude — Chapter 2’s prioritized sweeping, now over replayed transitions — with importance weights to correct the sampling bias. Rainbow Hessel et al. (2018) combines six such improvements (double, dueling, prioritized replay, multi-step returns, distributional values, and noisy exploration) and ablates each, showing they are largely complementary.

From states to pixels

DQN’s headline result was learning from raw Atari frames on the Arcade Learning Environment Bellemare et al. (2013) . The pixel pipeline adds its own engineering: grayscale and downsample each frame, stack four frames so velocity is observable (a single frame is not Markov), skip frames, clip rewards to {1,0,+1}\{-1,0,+1\}, and read the stack with a convolutional network. None of this changes the algorithm — it changes the features the same loss is regressed on.

Reliable comparison across all this machinery is itself hard — reproducibility studies and standardized implementations Raffin et al. (2021) exist precisely because small details swing results, a theme Week 8 returns to.

The dynamic-programming bridge

DQN is the deadly triad (Chapter 5) survived by engineering rather than dissolved by theory. The regression target is still the sampled Bellman optimality backup of Chapter 1; the two additions each blunt one edge of the triad. Replay re-weights and decorrelates the update distribution — the off-policy edge — and is the model-free heir to prioritized sweeping (Chapter 2). The target network freezes the bootstrap operator for CC steps — the moving-target edge — converting divergent fixed-point chasing into a sequence of stationary regressions. The discount γ\discount that guaranteed convergence with a table now only bounds the per-interval target; stability is bought, not proved.

What’s next

  • Week 7 changes the object of optimization entirely: instead of learning values and acting greedily, policy-gradient methods parameterize and optimize the policy directly, sidestepping the max\max and its overestimation, and extending naturally to continuous actions.

Exercises

  1. (Prove) Show E[maxaQ^(s,a)]maxaq(s,a)\E[\max_a \widehat{Q}(s',a)] \ge \max_a q(s',a) for unbiased Q^\widehat{Q}, and state when the inequality is strict (Prop. 6.1).

    Solution

    maxa\max_a is convex, so Jensen gives E[maxaQ^(s,a)]maxaE[Q^(s,a)]=maxaq(s,a)\E[\max_a\widehat{Q}(s',a)] \ge \max_a\E[\widehat{Q}(s',a)] = \max_a q(s',a). It is strict whenever the noise makes the argmax\arg\max random (two actions’ estimates overlap with positive probability), because the max\max is non-affine across the kink where the maximizer switches.

  2. (Derive) Write the Double DQN target and explain why it reduces the overestimation of Proposition 6.1.

    Solution

    yDouble=r+γQθ(s,arg maxaQθ(s,a))y^{\text{Double}} = r + \discount Q_{\theta^-}(s', \argmax_{a'}Q_\theta(s',a')). Selection uses QθQ_\theta, evaluation uses QθQ_{\theta^-}; their estimation errors are (largely) independent, so the action chosen as best by one network is not automatically assigned an inflated value by the same network. The E[max]\E[\max] bias becomes the much smaller bias of evaluating a possibly-suboptimal action.

  3. (Compute) A replay buffer of capacity 3 receives transitions t1,,t5t_1,\dots,t_5 in order. Which are stored after t5t_5, and why does a hard target update at step CC leave θ\theta^- momentarily equal to θ\theta?

    Solution

    A circular buffer of capacity 3 keeps the three most recent, {t3,t4,t5}\{t_3,t_4,t_5\} (t1,t2t_1,t_2 overwritten). A hard update copies θθ\theta^- \leftarrow \theta, so immediately after step CC the two networks are identical; they diverge again as θ\theta updates over the next CC steps while θ\theta^- is held fixed.

  4. (Implement) In the companion, verify the replay buffer, target-update, and TD-target components, and that DQN learns CartPole well above the random-return baseline within a fixed step budget.

    Solution

    See experiments/python/week06/test_dqn.py: circular-buffer overwrite and sample shapes; hard/soft target updates; the done-masked TD target and the Double-DQN selection/evaluation split; the empirical overestimation of the plain max; and a seeded CartPole run whose mean return rises far above the ~22 random baseline.

  5. (Extend) Add Double and Dueling variants and measure the overestimation gap (the plain-max target minus the Double target) over training.

    Solution

    The companion exposes double and dueling flags; the plain-max target sits above the Double target early in training (when value estimates are noisiest) and the gap shrinks as the network sharpens — the empirical face of Proposition 6.1.

Companion code

The Week-6 companion lives at experiments/python/week06/ and is the chapter’s first PyTorch code. Its correctness suite follows the repo’s deep-RL convention: fast, deterministic component tests for the pieces, plus a seeded simple-environment convergence check — heavy pixel environments are a deferred @slow showcase, not a graded test, in line with the 8 GB GPU budget.

  • dqn.py — a minimal DQN on CartPole-v1: a circular ReplayBuffer, an MLP QNetwork (and a DuelingQNetwork), an exposed td_target (plain and Double), hard/soft target updates, and the training loop, with double/dueling flags.
  • test_dqn.py — component-correctness tests (replay overwrite + sample shapes; target-network hard/soft updates; the done-masked Bellman target; the Double-DQN selection/evaluation split; the max-overestimation of Prop. 6.1) plus a seeded CartPole run asserting the mean return clears the random baseline by a wide margin.
# component tests + a seeded CartPole learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week06/test_dqn.py -q

# worked CartPole training run (prints the learning curve summary)
PYTHONPATH=. python experiments/python/week06/dqn.py --double --episodes 400