Part I · Foundations Week 3 Published mc.py test_mc.py

Monte Carlo Methods

Estimating value from sampled returns when the model is unknown: first-visit Monte Carlo prediction, Monte Carlo control by generalized policy iteration, and off-policy learning by importance sampling. Monte Carlo estimation as quadrature — unbiased, model-free, and high-variance, with the fragility of off-policy correction.

On this page

From a known model to sampled returns
First-visit Monte Carlo prediction
Monte Carlo control
Off-policy prediction and importance sampling
The dynamic-programming bridge
What’s next
Exercises
Companion code

Monte Carlo Methods

Where we are. Chapters 1 and 2 assumed the model $\transition$ was known and computed value functions by iterating the Bellman operator. This chapter takes the first step of reinforcement learning proper: the model is unknown, and value is estimated from sampled experience. The load-bearing shift is one substitution — replace the model expectation inside the value definition with an empirical average over complete sampled returns. The estimate is unbiased and needs no model, but it pays for that in variance and requires episodic tasks that terminate.

Chapter 3 — at a glance

Goal. Define the first-visit Monte Carlo estimator of $\valuefn_\policy$ and prove it unbiased and consistent; run Monte Carlo control as generalized policy iteration with sampled evaluation; and correct for off-policy data with importance sampling, seeing why that correction is fragile.

Reading time. ~35 minutes; ~55 with the proofs and exercises.

Key insight — the DP bridge. A value is an expectation, $\valuefn_\policy(s) = \E_\policy[G_t \mid S_t = s]$ . Dynamic programming evaluated that expectation analytically through the model; Monte Carlo evaluates it statistically by averaging sampled returns — the same target, estimated by quadrature instead of by the Bellman backup. Dropping the model removes bias but injects variance and forces episodes to terminate; Week 4’s temporal-difference learning rebalances exactly this trade.

From a known model to sampled returns

The definition of the state-value function (Chapter 1) is an expectation over trajectories,

\valuefn_\policy(s) = \E_\policy\!\big[\, G_t \,\big|\, S_t = s \,\big], \qquad G_t = \sum_{k=0}^{T-t-1} \discount^{k} R_{t+k+1},

where an episode runs from $t$ to a terminal time $T$ . Dynamic programming never sampled $G_t$ ; it used the model to turn this expectation into the Bellman recursion. Monte Carlo does the opposite: it leaves the expectation alone and estimates it the way one estimates any expectation without a closed form — draw samples and average.

That reframing — value estimation as quadrature — sets the terms for the whole chapter. The sample mean of returns is unbiased regardless of dimension, and its error falls like $1/\sqrt{N}$ in the number of episodes $N$ ; the variance $\sigma^2$ of the return is the price of having no model. Two structural requirements follow: episodes must terminate (an infinite return has no sample), and estimating $\valuefn_\policy$ requires acting under $\policy$ — or correcting for the fact that we did not, which is the off-policy problem below.

First-visit Monte Carlo prediction

To estimate $\valuefn_\policy$ , generate episodes under $\policy$ and, for each state, average the returns that followed visits to it. The first-visit variant averages only the return following the first time each state is reached in an episode, which gives one independent draw per episode (every-visit MC reuses correlated within-episode returns).

Definition 3.1 (First-visit Monte Carlo estimator).

Run $N$ episodes under $\policy$ . For state $s$ , let $\mathcal{I}(s)$ index the episodes in which $s$ is visited, and for episode $i \in \mathcal{I}(s)$ let $G^{(i)}(s)$ be the return from the first visit to $s$ . The first-visit Monte Carlo estimate is the sample mean

V_N(s) \defeq \frac{1}{\lvert\mathcal{I}(s)\rvert} \sum_{i \in \mathcal{I}(s)} G^{(i)}(s).

Proposition 3.1 (Unbiasedness and consistency).

Each first-visit return $G^{(i)}(s)$ is an independent sample of $(G_t \mid S_t = s)$ under $\policy$ . Hence $V_N(s)$ is unbiased, $\E[V_N(s)] = \valuefn_\policy(s)$ , and by the strong law of large numbers $V_N(s) \to \valuefn_\policy(s)$ almost surely as $\lvert\mathcal{I}(s)\rvert \to \infty$ .

Proof.

Fix $s$ . In each episode the first visit to $s$ occurs at some time $t$ , and the return from that point, $G^{(i)}(s)$ , is by construction a draw of the random variable $G_t$ conditioned on $S_t = s$ and on following $\policy$ thereafter — so its expectation is exactly $\valuefn_\policy(s)$ by the definition above. Different episodes are generated independently, and using only the first visit means each episode contributes one draw that does not depend on the others (every-visit sampling would reuse correlated within-episode returns). The $G^{(i)}(s)$ are thus i.i.d. with mean $\valuefn_\policy(s)$ ; a sample mean of i.i.d. draws is unbiased, and the strong law gives almost-sure convergence. $\qquad\blacksquare$

The estimator carries no bias and no model — it never references $\transition$ , only realized returns. Its weakness is variance: a single return aggregates all the randomness of a whole trajectory, so $\sigma^2$ can be large and convergence is only $1/\sqrt{N}$ . Sutton & Barto Sutton & Barto (2018) treat first- and every-visit MC and the bias each incurs; every-visit MC is biased in finite samples (its within-episode returns overlap, so the averaged samples are correlated, though the bias vanishes as $N\to\infty$ ) but also consistent, and is often simpler to implement.

Monte Carlo control

Prediction estimates $\valuefn_\policy$ ; control seeks $\optvaluefn$ . Monte Carlo control is generalized policy iteration (Chapter 2) with the evaluation step done by sampling: estimate the action-value $\qfn_\policy$ from returns, then improve the policy greedily, $\policy'(s) = \argmax_a \qfn_\policy(s,a)$ .

The catch is exploration. With no model, an action that is never tried has no return to average, so its value is unknown and greedy improvement can lock onto a wrong choice. Two standard fixes guarantee every action keeps being sampled:

Exploring starts — begin episodes from a random state–action pair, so every $(s,a)$ seeds infinitely many returns.
$\varepsilon$ -soft policies — keep the behaviour policy stochastic ( $\policy(a\mid s) \ge \varepsilon/\lvert\actionspace\rvert$ for all $a$ ), so no action is ever starved; improvement then converges to the best $\varepsilon$ -soft policy rather than the unconstrained optimum.

Either way the GPI logic of Chapter 1 carries over — evaluate, improve, repeat — with the policy-improvement theorem still guaranteeing monotone improvement at each step that uses an accurate $\qfn$ estimate. The convergence of exploring-starts MC control is taken as a working assumption here (it is not fully settled in general); Sutton & Barto Sutton & Barto (2018) discuss the subtlety.

Off-policy prediction and importance sampling

Often we must estimate $\valuefn_\policy$ for a target policy $\policy$ while the data was generated by a different behaviour policy $b$ — for instance to evaluate a greedy policy from exploratory data. Naively averaging returns from $b$ estimates $\valuefn_b$ , not $\valuefn_\policy$ . Importance sampling corrects the distribution mismatch by reweighting each return.

Definition 3.2 (Importance-sampling ratio and estimators).

For a trajectory from $t$ to termination $T$ , the importance-sampling ratio is the likelihood ratio of the action choices (the dynamics cancel, being shared),

\rho_{t:T-1} \defeq \prod_{k=t}^{T-1} \frac{\policy(A_k \mid S_k)}{b(A_k \mid S_k)} .

Over the first-visit returns $G^{(i)}$ with ratios $\rho^{(i)}$ , the ordinary and weighted importance-sampling estimators of $\valuefn_\policy(s)$ are

V^{\text{ord}}_N(s) = \frac{1}{\lvert\mathcal{I}(s)\rvert}\sum_{i} \rho^{(i)} G^{(i)}, \qquad V^{\text{wt}}_N(s) = \frac{\sum_{i} \rho^{(i)} G^{(i)}}{\sum_{i} \rho^{(i)}} .

Proposition 3.2 (Ordinary IS is unbiased; weighted IS is consistent).

Assuming coverage ( $b(a\mid s) > 0$ whenever $\policy(a\mid s) > 0$ ), ordinary importance sampling is unbiased, $\E_b[\rho_{t:T-1} G_t \mid S_t = s] = \valuefn_\policy(s)$ . Weighted importance sampling is biased in finite samples but consistent ( $V^{\text{wt}}_N \to \valuefn_\policy$ ), and typically has far lower variance.

Proof.

The probability of a trajectory’s action sequence under $\policy$ equals $\rho_{t:T-1}$ times its probability under $b$ , because the environment factors $\transition(s',r\mid s,a)$ appear identically in both and cancel — only the policy factors survive in the ratio. This is a change of measure: for any function $f$ of the trajectory, $\E_b[\rho_{t:T-1}\, f] = \E_\policy[f]$ . Taking $f = G_t$ gives $\E_b[\rho_{t:T-1} G_t \mid S_t = s] = \E_\policy[G_t\mid S_t=s] = \valuefn_\policy(s)$ , so the ordinary estimator (a sample mean of $\rho^{(i)}G^{(i)}$ ) is unbiased. The weighted estimator divides by $\sum_i \rho^{(i)}$ rather than the count; as a ratio of two correlated sample means it is biased at finite $N$ , but both means converge ( $\frac1N\sum\rho^{(i)}G^{(i)}\to\valuefn_\policy$ and $\frac1N\sum\rho^{(i)}\to 1$ ), so the ratio converges to $\valuefn_\policy(s)$ . $\qquad\blacksquare$

The two estimators sit at opposite ends of the bias–variance axis, and the variance end is where off-policy Monte Carlo becomes fragile. The ratio $\rho_{t:T-1}$ is a product over the episode; if $\policy$ and $b$ differ enough — or the horizon is long — the product swings across orders of magnitude, and the variance of ordinary IS can be unbounded: a single rare trajectory with a huge ratio dominates the average.

Weighted IS caps this — its estimate can never exceed the largest observed return — trading a vanishing bias for a decisive variance reduction, which is why it is the practical default. Both degrade as the horizon grows and more ratio factors multiply in; this fragility is a major reason the field leans on the lower-variance, bootstrapped methods of Week 4.

The dynamic-programming bridge

Monte Carlo and dynamic programming estimate the same object $\valuefn_\policy$ from opposite information. DP needs the model and bootstraps — each value is written in terms of other current estimates — giving low variance but model dependence and bias while the estimates are wrong. MC needs only the ability to sample episodes and never bootstraps — each value is an average of full returns — giving no model dependence and no bias but high variance, and only for terminating tasks. Plotted on two axes, bootstrapping and sampling, DP samples nothing and bootstraps fully; MC samples fully and bootstraps not at all. Week 4’s temporal-difference learning is the missing corner: sample like MC, bootstrap like DP, and inherit a blend of their bias and variance.

What’s next

Week 4 introduces temporal-difference learning: replace the full sampled return $G_t$ with the one-step bootstrap $R_{t+1} + \discount V(S_{t+1})$ , fusing Monte Carlo sampling with the dynamic-programming backup and removing the need to wait for an episode to terminate.
Week 5 confronts what happens when $V$ is a parametric approximation rather than a table, where bootstrapping and off-policy data interact dangerously.

Exercises

(Prove) Show the first-visit Monte Carlo estimator is unbiased for $\valuefn_\policy(s)$ for every fixed sample size, citing where independence across episodes is used (Prop. 3.1).

Solution
Each first-visit return $G^{(i)}(s)$ has $\E[G^{(i)}(s)] = \valuefn_\policy(s)$ by the definition of the value as the expected return from $s$ under $\policy$ . The estimator is the mean of $\lvert\mathcal{I}(s)\rvert$ such draws, so by linearity $\E[V_N(s)] = \valuefn_\policy(s)$ — unbiased at any $N$ . Independence (one draw per episode, first visit only) is not needed for unbiasedness but is what makes the variance $\sigma^2/\lvert\mathcal{I}(s)\rvert$ and licenses the strong law for consistency.
(Derive) Starting from the trajectory likelihood under $\policy$ and $b$ , derive the importance-sampling ratio and show $\E_b[\rho_{t:T-1} G_t \mid S_t=s] = \valuefn_\policy(s)$ (Prop. 3.2).

Solution
The probability of $A_t,S_{t+1},\dots,S_T$ given $S_t=s$ factors as $\prod_k \pi\text{-or-}b(A_k\mid S_k)\,\transition(S_{k+1},R_{k+1}\mid S_k,A_k)$ . Dividing the $\policy$ -likelihood by the $b$ -likelihood, every $\transition$ factor cancels, leaving $\rho_{t:T-1} = \prod_{k=t}^{T-1}\policy(A_k\mid S_k)/b(A_k\mid S_k)$ . Then $\E_b[\rho_{t:T-1}G_t\mid S_t=s] = \sum_{\text{traj}} b(\text{traj})\,\rho\,G = \sum_{\text{traj}}\policy(\text{traj})\,G = \E_\policy[G_t\mid S_t=s]=\valuefn_\policy(s)$ .
(Compute) Target $\policy$ is greedy (prob. 1 on action $a^\star$ ); behaviour $b$ is uniform over two actions. For an episode of length 3 that happens to take $a^\star$ at every step, compute $\rho_{0:2}$ . What is $\rho$ if any step takes the non-target action?

Solution
Each on-target step contributes $\policy/b = 1/(1/2) = 2$ , so $\rho_{0:2} = 2^3 = 8$ . If any step takes the non-target action, $\policy = 0$ there, so $\rho = 0$ — that trajectory contributes nothing, the discrete face of the variance problem (a few high-ratio trajectories carry the whole estimate).
(Implement) On the companion’s random-walk MDP, verify first-visit MC converges to the analytic $\valuefn_\policy$ from dp.policy_evaluation, and that ordinary IS is unbiased while weighted IS has lower variance across seeds.

Solution
See experiments/python/week03/test_mc.py: it computes $\valuefn_\policy$ exactly with the Week-1 linear solve, then asserts first-visit MC matches it within a sampling tolerance, the ordinary-IS mean across many seeds is unbiased, and — when the behaviour policy under-samples the reward path (the heavy-tailed regime; under a mild mismatch ordinary IS is already low-variance) — the weighted-IS empirical variance is below the ordinary-IS variance.
(Extend) Make the behaviour policy progressively worse (further from $\policy$ ) and measure how ordinary- and weighted-IS variance grow. Relate the blow-up to the product structure of $\rho_{t:T-1}$ .

Solution
As $b$ diverges from $\policy$ , individual step ratios stray from $1$ and their product’s variance compounds geometrically in the horizon; ordinary-IS variance grows fastest (it can be unbounded), weighted-IS more slowly (bounded by the largest return). The companion’s --mismatch sweep exhibits the monotone growth.

Companion code

The Week-3 companion lives at experiments/python/week03/ and reuses the Week-1 linear solve (dp.policy_evaluation) as the exact oracle against which the sampled estimates are checked — the core suite has no environment dependency.

randomwalk.py — builds a small episodic random-walk MDP (terminal states at both ends) as a generic (P, R) array pair, so $\valuefn_\policy$ is available in closed form from dp.policy_evaluation.
mc.py — episode sampling from a generic (P, R, policy), first-visit MC prediction, and off-policy prediction with both ordinary and weighted importance sampling. Pure NumPy.
test_mc.py — statistical-correctness tests: first-visit MC converges to the analytic $\valuefn_\policy$ within a sampling tolerance; ordinary IS is unbiased across seeds; weighted IS has strictly lower empirical variance.
blackjack.py — the canonical Sutton & Barto Monte-Carlo showcase: MC control on Gymnasium’s Blackjack-v1 (optional extra; skipped when Gymnasium is absent).

# core algorithms + correctness tests (pure NumPy, no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week03/test_mc.py -q

# canonical Blackjack MC-control showcase (optional: pip install "gymnasium")
PYTHONPATH=. python experiments/python/week03/blackjack.py