Part I · Foundations Week 6 Published dqn.py test_dqn.py

The DQN Family

Making approximate Q-learning stable enough for pixels: experience replay and target networks as direct countermeasures to the deadly triad, the overestimation bias that motivates Double DQN, and the dueling, prioritized, and Rainbow refinements. The engineering that scaled value-based RL to Atari.

On this page

From Q-learning to a deep Q-network
Experience replay
Target networks
Overestimation and Double DQN
Dueling, prioritized replay, and Rainbow
From states to pixels
The dynamic-programming bridge
What’s next
Exercises
Companion code

The DQN Family

Where we are. Chapter 4 gave us Q-learning; Chapter 5 warned that combining bootstrapping, off-policy data, and function approximation — the deadly triad — can diverge. The deep Q-network is that exact combination (Q-learning, off-policy, with a neural-network approximator), and naively it does diverge. What made it work on Atari from pixels — first as a 2013 workshop result Mnih et al. (2013) , then at human level Mnih et al. (2015) — was not new theory but two pieces of engineering that tame the triad’s two instabilities: experience replay and a target network. This chapter is about that engineering — and the overestimation bias that the next refinement, Double DQN, corrects.

Chapter 6 — at a glance

Goal. Write the DQN loss as a semi-gradient regression; understand experience replay and target networks as countermeasures to the deadly triad’s two failure modes; prove the max-operator overestimation bias and read Double DQN off it; and place dueling, prioritized replay, and Rainbow.

Reading time. ~35 minutes; ~55 with the companion and exercises.

Key insight — the DP bridge. DQN keeps the Bellman optimality backup of Chapter 1 as its regression target, but the deadly triad (Chapter 5) means that target both moves (it depends on the weights being trained) and is fed correlated, off-policy data. Replay re-weights and decorrelates the data; the target network freezes the backup operator for a stretch, turning a moving-target chase into a sequence of stationary supervised problems. Neither restores a true contraction — they buy enough stability for gradient descent to win in practice.

From Q-learning to a deep Q-network

Replace the Q-table with a network $Q_\theta(s,a)$ and fit it to the sampled Bellman optimality target. DQN minimizes, over transitions $(s,a,r,s')$ drawn from a replay buffer $\mathcal{D}$ ,

L(\theta) = \E_{(s,a,r,s')\sim\mathcal{D}}\Big[\big(\, r + \discount \max_{a'} Q_{\theta^-}(s',a') - Q_\theta(s,a)\,\big)^2\Big],

where $\theta^-$ are the target-network weights. As in Chapter 5 this is a semi-gradient method: we differentiate only $Q_\theta(s,a)$ , treating the target as fixed. Two structural problems would sink naive online training, and DQN answers each:

Correlated data. Consecutive transitions in a trajectory are highly correlated, violating the i.i.d. assumption gradient descent leans on.
A moving target. The regression target uses the same network being updated, so every gradient step shifts the target it was chasing — the bootstrap instability of the triad.

Experience replay

The first fix stores each transition $(s,a,r,s')$ in a fixed-capacity buffer $\mathcal{D}$ and trains on uniformly sampled minibatches rather than the latest transition. Three benefits follow: sampling across many past episodes decorrelates the minibatch toward the i.i.d. regime; each transition is reused many times, improving sample efficiency; and the update distribution becomes the buffer’s mixture rather than the current policy’s trajectory, re-weighting away from the pathological off-policy distributions that drive divergence.

The idea is old — it is the model-free, sampled descendant of the prioritized sweeping of Chapter 2 — and its prioritized variant returns shortly.

Target networks

The second fix freezes the bootstrap. DQN keeps a separate copy $Q_{\theta^-}$ of the network, holds it fixed while training $Q_\theta$ , and refreshes $\theta^- \leftarrow \theta$ only every $C$ steps (a hard update) or by a slow Polyak average $\theta^- \leftarrow \tau\theta + (1-\tau)\theta^-$ (a soft update). Between refreshes the target $r + \discount\max_{a'}Q_{\theta^-}(s',a')$ is a fixed function, so each interval is an ordinary supervised regression toward a stationary target — exactly the stability the moving-target triad destroyed.

The slower

\theta^-

moves, the more stable and the slower the learning — the central DQN tuning trade-off.

Overestimation and Double DQN

One bias survives even with replay and a target network: the $\max$ in the target overestimates. Because the network’s action-values are noisy estimates, taking their maximum is systematically too high.

Proposition 6.1 (The max overestimates).

Let $\widehat{Q}(s',\cdot)$ be unbiased estimates of the true values $q(s',\cdot)$ , i.e. $\E[\widehat{Q}(s',a)] = q(s',a)$ for each $a$ . Then

\E\big[\max_{a}\widehat{Q}(s',a)\big] \;\ge\; \max_{a}\E\big[\widehat{Q}(s',a)\big] = \max_a q(s',a),

with strict inequality whenever two or more actions’ estimates have positive-variance overlap. The bootstrap target therefore inherits a positive bias.

Proof.

The function $x \mapsto \max_a x_a$ is convex (a pointwise maximum of linear maps). Jensen’s inequality for a convex function gives $\E[\max_a \widehat{Q}(s',a)] \ge \max_a \E[\widehat{Q}(s',a)]$ , and the right side is $\max_a q(s',a)$ by unbiasedness. Jensen is an equality only where the convex function is affine along the distribution’s support; the $\max$ has a kink exactly where the maximizing action changes, so any noise that makes the $\arg\max$ random makes the inequality strict. $\qquad\blacksquare$

Double DQN Hasselt et al. (2016) removes most of this bias by decoupling action selection from evaluation: pick the next action with the online network but evaluate it with the target network,

y^{\text{Double}} = r + \discount\, Q_{\theta^-}\!\big(s',\, \argmax_{a'} Q_\theta(s',a')\big).

The selecting and evaluating estimates now have independent errors, so they no longer conspire to inflate the maximum — a one-line change that measurably improves scores.

Dueling, prioritized replay, and Rainbow

Three further refinements round out the family. The dueling architecture Wang et al. (2016) splits the network into a state-value stream $V(s)$ and an advantage stream $A(s,a)$ , recombined as $Q(s,a) = V(s) + \big(A(s,a) - \tfrac{1}{\lvert\actionspace\rvert}\sum_{a'}A(s,a')\big)$ , so the agent can learn a state is good without estimating every action precisely. Prioritized experience replay Schaul et al. (2016) samples transitions in proportion to their Bellman-residual magnitude — Chapter 2’s prioritized sweeping, now over replayed transitions — with importance weights to correct the sampling bias. Rainbow Hessel et al. (2018) combines six such improvements (double, dueling, prioritized replay, multi-step returns, distributional values, and noisy exploration) and ablates each, showing they are largely complementary.

From states to pixels

DQN’s headline result was learning from raw Atari frames on the Arcade Learning Environment Bellemare et al. (2013) . The pixel pipeline adds its own engineering: grayscale and downsample each frame, stack four frames so velocity is observable (a single frame is not Markov), skip frames, clip rewards to $\{-1,0,+1\}$ , and read the stack with a convolutional network. None of this changes the algorithm — it changes the features the same loss is regressed on.

Reliable comparison across all this machinery is itself hard — reproducibility studies and standardized implementations Raffin et al. (2021) exist precisely because small details swing results, a theme Week 8 returns to.

The dynamic-programming bridge

DQN is the deadly triad (Chapter 5) survived by engineering rather than dissolved by theory. The regression target is still the sampled Bellman optimality backup of Chapter 1; the two additions each blunt one edge of the triad. Replay re-weights and decorrelates the update distribution — the off-policy edge — and is the model-free heir to prioritized sweeping (Chapter 2). The target network freezes the bootstrap operator for $C$ steps — the moving-target edge — converting divergent fixed-point chasing into a sequence of stationary regressions. The discount $\discount$ that guaranteed convergence with a table now only bounds the per-interval target; stability is bought, not proved.

What’s next

Week 7 changes the object of optimization entirely: instead of learning values and acting greedily, policy-gradient methods parameterize and optimize the policy directly, sidestepping the $\max$ and its overestimation, and extending naturally to continuous actions.

Exercises

(Prove) Show $\E[\max_a \widehat{Q}(s',a)] \ge \max_a q(s',a)$ for unbiased $\widehat{Q}$ , and state when the inequality is strict (Prop. 6.1).

Solution
$\max_a$ is convex, so Jensen gives $\E[\max_a\widehat{Q}(s',a)] \ge \max_a\E[\widehat{Q}(s',a)] = \max_a q(s',a)$ . It is strict whenever the noise makes the $\arg\max$ random (two actions’ estimates overlap with positive probability), because the $\max$ is non-affine across the kink where the maximizer switches.
(Derive) Write the Double DQN target and explain why it reduces the overestimation of Proposition 6.1.

Solution
$y^{\text{Double}} = r + \discount Q_{\theta^-}(s', \argmax_{a'}Q_\theta(s',a'))$ . Selection uses $Q_\theta$ , evaluation uses $Q_{\theta^-}$ ; their estimation errors are (largely) independent, so the action chosen as best by one network is not automatically assigned an inflated value by the same network. The $\E[\max]$ bias becomes the much smaller bias of evaluating a possibly-suboptimal action.
(Compute) A replay buffer of capacity 3 receives transitions $t_1,\dots,t_5$ in order. Which are stored after $t_5$ , and why does a hard target update at step $C$ leave $\theta^-$ momentarily equal to $\theta$ ?

Solution
A circular buffer of capacity 3 keeps the three most recent, $\{t_3,t_4,t_5\}$ ( $t_1,t_2$ overwritten). A hard update copies $\theta^- \leftarrow \theta$ , so immediately after step $C$ the two networks are identical; they diverge again as $\theta$ updates over the next $C$ steps while $\theta^-$ is held fixed.
(Implement) In the companion, verify the replay buffer, target-update, and TD-target components, and that DQN learns CartPole well above the random-return baseline within a fixed step budget.

Solution
See experiments/python/week06/test_dqn.py: circular-buffer overwrite and sample shapes; hard/soft target updates; the done-masked TD target and the Double-DQN selection/evaluation split; the empirical overestimation of the plain max; and a seeded CartPole run whose mean return rises far above the ~22 random baseline.
(Extend) Add Double and Dueling variants and measure the overestimation gap (the plain-max target minus the Double target) over training.

Solution
The companion exposes double and dueling flags; the plain-max target sits above the Double target early in training (when value estimates are noisiest) and the gap shrinks as the network sharpens — the empirical face of Proposition 6.1.

Companion code

The Week-6 companion lives at experiments/python/week06/ and is the chapter’s first PyTorch code. Its correctness suite follows the repo’s deep-RL convention: fast, deterministic component tests for the pieces, plus a seeded simple-environment convergence check — heavy pixel environments are a deferred @slow showcase, not a graded test, in line with the 8 GB GPU budget.

dqn.py — a minimal DQN on CartPole-v1: a circular ReplayBuffer, an MLP QNetwork (and a DuelingQNetwork), an exposed td_target (plain and Double), hard/soft target updates, and the training loop, with double/dueling flags.
test_dqn.py — component-correctness tests (replay overwrite + sample shapes; target-network hard/soft updates; the done-masked Bellman target; the Double-DQN selection/evaluation split; the max-overestimation of Prop. 6.1) plus a seeded CartPole run asserting the mean return clears the random baseline by a wide margin.

# component tests + a seeded CartPole learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week06/test_dqn.py -q

# worked CartPole training run (prints the learning curve summary)
PYTHONPATH=. python experiments/python/week06/dqn.py --double --episodes 400