Monte Carlo Methods
Estimating value from sampled returns when the model is unknown: first-visit Monte Carlo prediction, Monte Carlo control by generalized policy iteration, and off-policy learning by importance sampling. Monte Carlo estimation as quadrature — unbiased, model-free, and high-variance, with the fragility of off-policy correction.
On this page
Monte Carlo Methods
Where we are. Chapters 1 and 2 assumed the model was known and computed value functions by iterating the Bellman operator. This chapter takes the first step of reinforcement learning proper: the model is unknown, and value is estimated from sampled experience. The load-bearing shift is one substitution — replace the model expectation inside the value definition with an empirical average over complete sampled returns. The estimate is unbiased and needs no model, but it pays for that in variance and requires episodic tasks that terminate.
From a known model to sampled returns
The definition of the state-value function (Chapter 1) is an expectation over trajectories,
where an episode runs from to a terminal time . Dynamic programming never sampled ; it used the model to turn this expectation into the Bellman recursion. Monte Carlo does the opposite: it leaves the expectation alone and estimates it the way one estimates any expectation without a closed form — draw samples and average.
That reframing — value estimation as quadrature — sets the terms for the whole chapter. The sample mean of returns is unbiased regardless of dimension, and its error falls like in the number of episodes ; the variance of the return is the price of having no model. Two structural requirements follow: episodes must terminate (an infinite return has no sample), and estimating requires acting under — or correcting for the fact that we did not, which is the off-policy problem below.
First-visit Monte Carlo prediction
To estimate , generate episodes under and, for each state, average the returns that followed visits to it. The first-visit variant averages only the return following the first time each state is reached in an episode, which gives one independent draw per episode (every-visit MC reuses correlated within-episode returns).
Run episodes under . For state , let index the episodes in which is visited, and for episode let be the return from the first visit to . The first-visit Monte Carlo estimate is the sample mean
Each first-visit return is an independent sample of under . Hence is unbiased, , and by the strong law of large numbers almost surely as .
Fix . In each episode the first visit to occurs at some time , and the return from that point, , is by construction a draw of the random variable conditioned on and on following thereafter — so its expectation is exactly by the definition above. Different episodes are generated independently, and using only the first visit means each episode contributes one draw that does not depend on the others (every-visit sampling would reuse correlated within-episode returns). The are thus i.i.d. with mean ; a sample mean of i.i.d. draws is unbiased, and the strong law gives almost-sure convergence.
The estimator carries no bias and no model — it never references , only realized returns. Its weakness is variance: a single return aggregates all the randomness of a whole trajectory, so can be large and convergence is only . Sutton & Barto Sutton & Barto (2018) treat first- and every-visit MC and the bias each incurs; every-visit MC is biased in finite samples (its within-episode returns overlap, so the averaged samples are correlated, though the bias vanishes as ) but also consistent, and is often simpler to implement.
Monte Carlo control
Prediction estimates ; control seeks . Monte Carlo control is generalized policy iteration (Chapter 2) with the evaluation step done by sampling: estimate the action-value from returns, then improve the policy greedily, . The catch is exploration. With no model, an action that is never tried has no return to average, so its value is unknown and greedy improvement can lock onto a wrong choice. Two standard fixes guarantee every action keeps being sampled:
- Exploring starts — begin episodes from a random state–action pair, so every seeds infinitely many returns.
- -soft policies — keep the behaviour policy stochastic ( for all ), so no action is ever starved; improvement then converges to the best -soft policy rather than the unconstrained optimum.
Either way the GPI logic of Chapter 1 carries over — evaluate, improve, repeat — with the policy-improvement theorem still guaranteeing monotone improvement at each step that uses an accurate estimate. The convergence of exploring-starts MC control is taken as a working assumption here (it is not fully settled in general); Sutton & Barto Sutton & Barto (2018) discuss the subtlety.
Off-policy prediction and importance sampling
Often we must estimate for a target policy while the data was generated by a different behaviour policy — for instance to evaluate a greedy policy from exploratory data. Naively averaging returns from estimates , not . Importance sampling corrects the distribution mismatch by reweighting each return.
For a trajectory from to termination , the importance-sampling ratio is the likelihood ratio of the action choices (the dynamics cancel, being shared),
Over the first-visit returns with ratios , the ordinary and weighted importance-sampling estimators of are
Assuming coverage ( whenever ), ordinary importance sampling is unbiased, . Weighted importance sampling is biased in finite samples but consistent (), and typically has far lower variance.
The probability of a trajectory’s action sequence under equals times its probability under , because the environment factors appear identically in both and cancel — only the policy factors survive in the ratio. This is a change of measure: for any function of the trajectory, . Taking gives , so the ordinary estimator (a sample mean of ) is unbiased. The weighted estimator divides by rather than the count; as a ratio of two correlated sample means it is biased at finite , but both means converge ( and ), so the ratio converges to .
The two estimators sit at opposite ends of the bias–variance axis, and the variance end is where off-policy Monte Carlo becomes fragile. The ratio is a product over the episode; if and differ enough — or the horizon is long — the product swings across orders of magnitude, and the variance of ordinary IS can be unbounded: a single rare trajectory with a huge ratio dominates the average. Weighted IS caps this — its estimate can never exceed the largest observed return — trading a vanishing bias for a decisive variance reduction, which is why it is the practical default. Both degrade as the horizon grows and more ratio factors multiply in; this fragility is a major reason the field leans on the lower-variance, bootstrapped methods of Week 4.
The dynamic-programming bridge
Monte Carlo and dynamic programming estimate the same object from opposite information. DP needs the model and bootstraps — each value is written in terms of other current estimates — giving low variance but model dependence and bias while the estimates are wrong. MC needs only the ability to sample episodes and never bootstraps — each value is an average of full returns — giving no model dependence and no bias but high variance, and only for terminating tasks. Plotted on two axes, bootstrapping and sampling, DP samples nothing and bootstraps fully; MC samples fully and bootstraps not at all. Week 4’s temporal-difference learning is the missing corner: sample like MC, bootstrap like DP, and inherit a blend of their bias and variance.
What’s next
- Week 4 introduces temporal-difference learning: replace the full sampled return with the one-step bootstrap , fusing Monte Carlo sampling with the dynamic-programming backup and removing the need to wait for an episode to terminate.
- Week 5 confronts what happens when is a parametric approximation rather than a table, where bootstrapping and off-policy data interact dangerously.
Exercises
-
(Prove) Show the first-visit Monte Carlo estimator is unbiased for for every fixed sample size, citing where independence across episodes is used (Prop. 3.1).
Solution
Each first-visit return has by the definition of the value as the expected return from under . The estimator is the mean of such draws, so by linearity — unbiased at any . Independence (one draw per episode, first visit only) is not needed for unbiasedness but is what makes the variance and licenses the strong law for consistency.
-
(Derive) Starting from the trajectory likelihood under and , derive the importance-sampling ratio and show (Prop. 3.2).
Solution
The probability of given factors as . Dividing the -likelihood by the -likelihood, every factor cancels, leaving . Then .
-
(Compute) Target is greedy (prob. 1 on action ); behaviour is uniform over two actions. For an episode of length 3 that happens to take at every step, compute . What is if any step takes the non-target action?
Solution
Each on-target step contributes , so . If any step takes the non-target action, there, so — that trajectory contributes nothing, the discrete face of the variance problem (a few high-ratio trajectories carry the whole estimate).
-
(Implement) On the companion’s random-walk MDP, verify first-visit MC converges to the analytic from
dp.policy_evaluation, and that ordinary IS is unbiased while weighted IS has lower variance across seeds.Solution
See
experiments/python/week03/test_mc.py: it computes exactly with the Week-1 linear solve, then asserts first-visit MC matches it within a sampling tolerance, the ordinary-IS mean across many seeds is unbiased, and — when the behaviour policy under-samples the reward path (the heavy-tailed regime; under a mild mismatch ordinary IS is already low-variance) — the weighted-IS empirical variance is below the ordinary-IS variance. -
(Extend) Make the behaviour policy progressively worse (further from ) and measure how ordinary- and weighted-IS variance grow. Relate the blow-up to the product structure of .
Solution
As diverges from , individual step ratios stray from and their product’s variance compounds geometrically in the horizon; ordinary-IS variance grows fastest (it can be unbounded), weighted-IS more slowly (bounded by the largest return). The companion’s
--mismatchsweep exhibits the monotone growth.
Companion code
The Week-3 companion lives at experiments/python/week03/ and reuses the Week-1
linear solve (dp.policy_evaluation) as the exact oracle against which the sampled
estimates are checked — the core suite has no environment dependency.
randomwalk.py— builds a small episodic random-walk MDP (terminal states at both ends) as a generic(P, R)array pair, so is available in closed form fromdp.policy_evaluation.mc.py— episode sampling from a generic(P, R, policy), first-visit MC prediction, and off-policy prediction with both ordinary and weighted importance sampling. Pure NumPy.test_mc.py— statistical-correctness tests: first-visit MC converges to the analytic within a sampling tolerance; ordinary IS is unbiased across seeds; weighted IS has strictly lower empirical variance.blackjack.py— the canonical Sutton & Barto Monte-Carlo showcase: MC control on Gymnasium’sBlackjack-v1(optional extra; skipped when Gymnasium is absent).
# core algorithms + correctness tests (pure NumPy, no Gymnasium needed)
PYTHONPATH=. pytest experiments/python/week03/test_mc.py -q
# canonical Blackjack MC-control showcase (optional: pip install "gymnasium")
PYTHONPATH=. python experiments/python/week03/blackjack.py