Part I · Foundations Week 3 Published mc.py test_mc.py

Monte Carlo Methods

Estimating value from sampled returns when the model is unknown: first-visit Monte Carlo prediction, Monte Carlo control by generalized policy iteration, and off-policy learning by importance sampling. Monte Carlo estimation as quadrature — unbiased, model-free, and high-variance, with the fragility of off-policy correction.

On this page
  1. From a known model to sampled returns
  2. First-visit Monte Carlo prediction
  3. Monte Carlo control
  4. Off-policy prediction and importance sampling
  5. The dynamic-programming bridge
  6. What’s next
  7. Exercises
  8. Companion code

Monte Carlo Methods

Where we are. Chapters 1 and 2 assumed the model p\transition was known and computed value functions by iterating the Bellman operator. This chapter takes the first step of reinforcement learning proper: the model is unknown, and value is estimated from sampled experience. The load-bearing shift is one substitution — replace the model expectation inside the value definition with an empirical average over complete sampled returns. The estimate is unbiased and needs no model, but it pays for that in variance and requires episodic tasks that terminate.

From a known model to sampled returns

The definition of the state-value function (Chapter 1) is an expectation over trajectories,

vπ(s)=Eπ ⁣[GtSt=s],Gt=k=0Tt1γkRt+k+1,\valuefn_\policy(s) = \E_\policy\!\big[\, G_t \,\big|\, S_t = s \,\big], \qquad G_t = \sum_{k=0}^{T-t-1} \discount^{k} R_{t+k+1},

where an episode runs from tt to a terminal time TT. Dynamic programming never sampled GtG_t; it used the model to turn this expectation into the Bellman recursion. Monte Carlo does the opposite: it leaves the expectation alone and estimates it the way one estimates any expectation without a closed form — draw samples and average.

That reframing — value estimation as quadrature — sets the terms for the whole chapter. The sample mean of returns is unbiased regardless of dimension, and its error falls like 1/N1/\sqrt{N} in the number of episodes NN; the variance σ2\sigma^2 of the return is the price of having no model. Two structural requirements follow: episodes must terminate (an infinite return has no sample), and estimating vπ\valuefn_\policy requires acting under π\policy — or correcting for the fact that we did not, which is the off-policy problem below.

First-visit Monte Carlo prediction

To estimate vπ\valuefn_\policy, generate episodes under π\policy and, for each state, average the returns that followed visits to it. The first-visit variant averages only the return following the first time each state is reached in an episode, which gives one independent draw per episode (every-visit MC reuses correlated within-episode returns).

Definition 3.1 (First-visit Monte Carlo estimator).

Run NN episodes under π\policy. For state ss, let I(s)\mathcal{I}(s) index the episodes in which ss is visited, and for episode iI(s)i \in \mathcal{I}(s) let G(i)(s)G^{(i)}(s) be the return from the first visit to ss. The first-visit Monte Carlo estimate is the sample mean

VN(s)1I(s)iI(s)G(i)(s).V_N(s) \defeq \frac{1}{\lvert\mathcal{I}(s)\rvert} \sum_{i \in \mathcal{I}(s)} G^{(i)}(s).
Proposition 3.1 (Unbiasedness and consistency).

Each first-visit return G(i)(s)G^{(i)}(s) is an independent sample of (GtSt=s)(G_t \mid S_t = s) under π\policy. Hence VN(s)V_N(s) is unbiased, E[VN(s)]=vπ(s)\E[V_N(s)] = \valuefn_\policy(s), and by the strong law of large numbers VN(s)vπ(s)V_N(s) \to \valuefn_\policy(s) almost surely as I(s)\lvert\mathcal{I}(s)\rvert \to \infty.

Proof.

Fix ss. In each episode the first visit to ss occurs at some time tt, and the return from that point, G(i)(s)G^{(i)}(s), is by construction a draw of the random variable GtG_t conditioned on St=sS_t = s and on following π\policy thereafter — so its expectation is exactly vπ(s)\valuefn_\policy(s) by the definition above. Different episodes are generated independently, and using only the first visit means each episode contributes one draw that does not depend on the others (every-visit sampling would reuse correlated within-episode returns). The G(i)(s)G^{(i)}(s) are thus i.i.d. with mean vπ(s)\valuefn_\policy(s); a sample mean of i.i.d. draws is unbiased, and the strong law gives almost-sure convergence. \qquad\blacksquare

The estimator carries no bias and no model — it never references p\transition, only realized returns. Its weakness is variance: a single return aggregates all the randomness of a whole trajectory, so σ2\sigma^2 can be large and convergence is only 1/N1/\sqrt{N}. Sutton & Barto Sutton & Barto (2018) treat first- and every-visit MC and the bias each incurs; every-visit MC is biased in finite samples (its within-episode returns overlap, so the averaged samples are correlated, though the bias vanishes as NN\to\infty) but also consistent, and is often simpler to implement.

Monte Carlo control

Prediction estimates vπ\valuefn_\policy; control seeks v\optvaluefn. Monte Carlo control is generalized policy iteration (Chapter 2) with the evaluation step done by sampling: estimate the action-value qπ\qfn_\policy from returns, then improve the policy greedily, π(s)=arg maxaqπ(s,a)\policy'(s) = \argmax_a \qfn_\policy(s,a).

The catch is exploration. With no model, an action that is never tried has no return to average, so its value is unknown and greedy improvement can lock onto a wrong choice. Two standard fixes guarantee every action keeps being sampled:

  • Exploring starts — begin episodes from a random state–action pair, so every (s,a)(s,a) seeds infinitely many returns.
  • ε\varepsilon-soft policies — keep the behaviour policy stochastic (π(as)ε/A\policy(a\mid s) \ge \varepsilon/\lvert\actionspace\rvert for all aa), so no action is ever starved; improvement then converges to the best ε\varepsilon-soft policy rather than the unconstrained optimum.

Either way the GPI logic of Chapter 1 carries over — evaluate, improve, repeat — with the policy-improvement theorem still guaranteeing monotone improvement at each step that uses an accurate q\qfn estimate. The convergence of exploring-starts MC control is taken as a working assumption here (it is not fully settled in general); Sutton & Barto Sutton & Barto (2018) discuss the subtlety.

Off-policy prediction and importance sampling

Often we must estimate vπ\valuefn_\policy for a target policy π\policy while the data was generated by a different behaviour policy bb — for instance to evaluate a greedy policy from exploratory data. Naively averaging returns from bb estimates vb\valuefn_b, not vπ\valuefn_\policy. Importance sampling corrects the distribution mismatch by reweighting each return.

Definition 3.2 (Importance-sampling ratio and estimators).

For a trajectory from tt to termination TT, the importance-sampling ratio is the likelihood ratio of the action choices (the dynamics cancel, being shared),

ρt:T1k=tT1π(AkSk)b(AkSk).\rho_{t:T-1} \defeq \prod_{k=t}^{T-1} \frac{\policy(A_k \mid S_k)}{b(A_k \mid S_k)} .

Over the first-visit returns G(i)G^{(i)} with ratios ρ(i)\rho^{(i)}, the ordinary and weighted importance-sampling estimators of vπ(s)\valuefn_\policy(s) are

VNord(s)=1I(s)iρ(i)G(i),VNwt(s)=iρ(i)G(i)iρ(i).V^{\text{ord}}_N(s) = \frac{1}{\lvert\mathcal{I}(s)\rvert}\sum_{i} \rho^{(i)} G^{(i)}, \qquad V^{\text{wt}}_N(s) = \frac{\sum_{i} \rho^{(i)} G^{(i)}}{\sum_{i} \rho^{(i)}} .
Proposition 3.2 (Ordinary IS is unbiased; weighted IS is consistent).

Assuming coverage (b(as)>0b(a\mid s) > 0 whenever π(as)>0\policy(a\mid s) > 0), ordinary importance sampling is unbiased, Eb[ρt:T1GtSt=s]=vπ(s)\E_b[\rho_{t:T-1} G_t \mid S_t = s] = \valuefn_\policy(s). Weighted importance sampling is biased in finite samples but consistent (VNwtvπV^{\text{wt}}_N \to \valuefn_\policy), and typically has far lower variance.

Proof.

The probability of a trajectory’s action sequence under π\policy equals ρt:T1\rho_{t:T-1} times its probability under bb, because the environment factors p(s,rs,a)\transition(s',r\mid s,a) appear identically in both and cancel — only the policy factors survive in the ratio. This is a change of measure: for any function ff of the trajectory, Eb[ρt:T1f]=Eπ[f]\E_b[\rho_{t:T-1}\, f] = \E_\policy[f]. Taking f=Gtf = G_t gives Eb[ρt:T1GtSt=s]=Eπ[GtSt=s]=vπ(s)\E_b[\rho_{t:T-1} G_t \mid S_t = s] = \E_\policy[G_t\mid S_t=s] = \valuefn_\policy(s), so the ordinary estimator (a sample mean of ρ(i)G(i)\rho^{(i)}G^{(i)}) is unbiased. The weighted estimator divides by iρ(i)\sum_i \rho^{(i)} rather than the count; as a ratio of two correlated sample means it is biased at finite NN, but both means converge (1Nρ(i)G(i)vπ\frac1N\sum\rho^{(i)}G^{(i)}\to\valuefn_\policy and 1Nρ(i)1\frac1N\sum\rho^{(i)}\to 1), so the ratio converges to vπ(s)\valuefn_\policy(s). \qquad\blacksquare

The two estimators sit at opposite ends of the bias–variance axis, and the variance end is where off-policy Monte Carlo becomes fragile. The ratio ρt:T1\rho_{t:T-1} is a product over the episode; if π\policy and bb differ enough — or the horizon is long — the product swings across orders of magnitude, and the variance of ordinary IS can be unbounded: a single rare trajectory with a huge ratio dominates the average.

Weighted IS caps this — its estimate can never exceed the largest observed return — trading a vanishing bias for a decisive variance reduction, which is why it is the practical default. Both degrade as the horizon grows and more ratio factors multiply in; this fragility is a major reason the field leans on the lower-variance, bootstrapped methods of Week 4.

The dynamic-programming bridge

Monte Carlo and dynamic programming estimate the same object vπ\valuefn_\policy from opposite information. DP needs the model and bootstraps — each value is written in terms of other current estimates — giving low variance but model dependence and bias while the estimates are wrong. MC needs only the ability to sample episodes and never bootstraps — each value is an average of full returns — giving no model dependence and no bias but high variance, and only for terminating tasks. Plotted on two axes, bootstrapping and sampling, DP samples nothing and bootstraps fully; MC samples fully and bootstraps not at all. Week 4’s temporal-difference learning is the missing corner: sample like MC, bootstrap like DP, and inherit a blend of their bias and variance.

What’s next

  • Week 4 introduces temporal-difference learning: replace the full sampled return GtG_t with the one-step bootstrap Rt+1+γV(St+1)R_{t+1} + \discount V(S_{t+1}), fusing Monte Carlo sampling with the dynamic-programming backup and removing the need to wait for an episode to terminate.
  • Week 5 confronts what happens when VV is a parametric approximation rather than a table, where bootstrapping and off-policy data interact dangerously.

Exercises

  1. (Prove) Show the first-visit Monte Carlo estimator is unbiased for vπ(s)\valuefn_\policy(s) for every fixed sample size, citing where independence across episodes is used (Prop. 3.1).

    Solution

    Each first-visit return G(i)(s)G^{(i)}(s) has E[G(i)(s)]=vπ(s)\E[G^{(i)}(s)] = \valuefn_\policy(s) by the definition of the value as the expected return from ss under π\policy. The estimator is the mean of I(s)\lvert\mathcal{I}(s)\rvert such draws, so by linearity E[VN(s)]=vπ(s)\E[V_N(s)] = \valuefn_\policy(s) — unbiased at any NN. Independence (one draw per episode, first visit only) is not needed for unbiasedness but is what makes the variance σ2/I(s)\sigma^2/\lvert\mathcal{I}(s)\rvert and licenses the strong law for consistency.

  2. (Derive) Starting from the trajectory likelihood under π\policy and bb, derive the importance-sampling ratio and show Eb[ρt:T1GtSt=s]=vπ(s)\E_b[\rho_{t:T-1} G_t \mid S_t=s] = \valuefn_\policy(s) (Prop. 3.2).

    Solution

    The probability of At,St+1,,STA_t,S_{t+1},\dots,S_T given St=sS_t=s factors as kπ-or-b(AkSk)p(Sk+1,Rk+1Sk,Ak)\prod_k \pi\text{-or-}b(A_k\mid S_k)\,\transition(S_{k+1},R_{k+1}\mid S_k,A_k). Dividing the π\policy-likelihood by the bb-likelihood, every p\transition factor cancels, leaving ρt:T1=k=tT1π(AkSk)/b(AkSk)\rho_{t:T-1} = \prod_{k=t}^{T-1}\policy(A_k\mid S_k)/b(A_k\mid S_k). Then Eb[ρt:T1GtSt=s]=trajb(traj)ρG=trajπ(traj)G=Eπ[GtSt=s]=vπ(s)\E_b[\rho_{t:T-1}G_t\mid S_t=s] = \sum_{\text{traj}} b(\text{traj})\,\rho\,G = \sum_{\text{traj}}\policy(\text{traj})\,G = \E_\policy[G_t\mid S_t=s]=\valuefn_\policy(s).

  3. (Compute) Target π\policy is greedy (prob. 1 on action aa^\star); behaviour bb is uniform over two actions. For an episode of length 3 that happens to take aa^\star at every step, compute ρ0:2\rho_{0:2}. What is ρ\rho if any step takes the non-target action?

    Solution

    Each on-target step contributes π/b=1/(1/2)=2\policy/b = 1/(1/2) = 2, so ρ0:2=23=8\rho_{0:2} = 2^3 = 8. If any step takes the non-target action, π=0\policy = 0 there, so ρ=0\rho = 0 — that trajectory contributes nothing, the discrete face of the variance problem (a few high-ratio trajectories carry the whole estimate).

  4. (Implement) On the companion’s random-walk MDP, verify first-visit MC converges to the analytic vπ\valuefn_\policy from dp.policy_evaluation, and that ordinary IS is unbiased while weighted IS has lower variance across seeds.

    Solution

    See experiments/python/week03/test_mc.py: it computes vπ\valuefn_\policy exactly with the Week-1 linear solve, then asserts first-visit MC matches it within a sampling tolerance, the ordinary-IS mean across many seeds is unbiased, and — when the behaviour policy under-samples the reward path (the heavy-tailed regime; under a mild mismatch ordinary IS is already low-variance) — the weighted-IS empirical variance is below the ordinary-IS variance.

  5. (Extend) Make the behaviour policy progressively worse (further from π\policy) and measure how ordinary- and weighted-IS variance grow. Relate the blow-up to the product structure of ρt:T1\rho_{t:T-1}.

    Solution

    As bb diverges from π\policy, individual step ratios stray from 11 and their product’s variance compounds geometrically in the horizon; ordinary-IS variance grows fastest (it can be unbounded), weighted-IS more slowly (bounded by the largest return). The companion’s --mismatch sweep exhibits the monotone growth.

Companion code

The Week-3 companion lives at experiments/python/week03/ and reuses the Week-1 linear solve (dp.policy_evaluation) as the exact oracle against which the sampled estimates are checked — the core suite has no environment dependency.

  • randomwalk.py — builds a small episodic random-walk MDP (terminal states at both ends) as a generic (P, R) array pair, so vπ\valuefn_\policy is available in closed form from dp.policy_evaluation.
  • mc.py — episode sampling from a generic (P, R, policy), first-visit MC prediction, and off-policy prediction with both ordinary and weighted importance sampling. Pure NumPy.
  • test_mc.py — statistical-correctness tests: first-visit MC converges to the analytic vπ\valuefn_\policy within a sampling tolerance; ordinary IS is unbiased across seeds; weighted IS has strictly lower empirical variance.
  • blackjack.py — the canonical Sutton & Barto Monte-Carlo showcase: MC control on Gymnasium’s Blackjack-v1 (optional extra; skipped when Gymnasium is absent).
# core algorithms + correctness tests (pure NumPy, no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week03/test_mc.py -q

# canonical Blackjack MC-control showcase (optional: pip install "gymnasium")
PYTHONPATH=. python experiments/python/week03/blackjack.py