Policy Gradient Foundations
Optimizing a parameterized stochastic policy directly by gradient ascent on expected return: the policy gradient theorem via the log-derivative trick, REINFORCE, and baselines as variance-reducing control variates. Policy gradients as Monte Carlo sensitivity analysis — and the advantage that bridges to actor-critic.
On this page
Policy Gradient Foundations
Where we are. Every method so far learned a value and acted greedily — the of Chapters 4–6. Policy-gradient methods discard that indirection: they parameterize the policy and ascend the gradient of expected return directly. This sidesteps the and its overestimation (Chapter 6), handles stochastic policies and continuous actions natively, and rests on one identity — the log-derivative trick — that turns “differentiate an expectation you can only sample” into “weight samples by the score .” The roadmap’s framing is exact: policy gradients are Monte Carlo sensitivity analysis.
The objective and the score
Let the policy be , differentiable in . A trajectory has return , and the objective is its expectation,
We cannot differentiate by differentiating the reward — the dependence on is through the sampling distribution of , not the integrand. The score function is what carries that dependence, via one identity.
The policy gradient theorem
The gradient of has a famously clean form, due to Sutton et al. Sutton et al. (2000) : an expectation of the score weighted by the action-value, with no derivative of the unknown dynamics anywhere in it.
The gradient of the expected return is
No derivative of the dynamics or the reward appears — only the score of the policy.
Write with trajectory density . Differentiate and apply the log-derivative trick :
In the initial-state and dynamics terms do not depend on , so — the model need not be known or differentiated. That gives the first form. Using causality (an action cannot influence past rewards, for ) replaces by the return-to-go, whose conditional expectation is , giving the second.
The estimator is Monte Carlo sensitivity analysis: it estimates from samples without differentiating the sampled function, by reweighting each sample by its score. It is unbiased and, being Monte Carlo (Chapter 3), high-variance — which the rest of the chapter attacks.
REINFORCE
Sampling the theorem’s expectation gives the REINFORCE algorithm Williams (1992) : roll out episodes under , form the returns-to-go , and ascend
It is unbiased and model-free, the policy-space counterpart of Monte Carlo value estimation — and inherits Monte Carlo’s variance. The single most effective fix is a baseline.
Baselines as control variates
For any function that does not depend on the action,
so subtracting from the return weight in the policy gradient leaves it unbiased, while choosing to track the typical return reduces its variance.
Condition on and average the score over actions:
Hence : subtracting changes the gradient’s variance but not its mean. The variance-minimizing choice makes the weight a centered quantity; taking turns the weight into the advantage .
A baseline is precisely a control variate: a zero-mean term subtracted to shrink variance without moving the estimate. With , the policy gradient becomes — the form every actor-critic method (Week 8) estimates. Learning the baseline is what makes it a critic.
The dynamic-programming bridge
Policy gradients invert the value-based pattern. Chapters 4–6 solved a Bellman fixed point and read off a greedy policy; here the policy is the primal object optimized by gradient ascent, and value functions re-enter only as the baseline/critic that tames variance. Three threads carry forward:
- To actor-critic (Week 8). Learn (or the advantage directly) as the baseline; the policy is the actor, the value the critic. Generalized advantage estimation tunes the bias–variance of , and trust regions (PPO/TRPO) control the ascent step size in policy space.
- To continuous control (Week 9). No over actions is ever taken, so continuous action spaces are immediate — the setting where DDPG, TD3, and SAC live.
- To optimal control (Part II). Differentiating an expected cost through the dynamics is the discrete, stochastic cousin of the adjoint/Hamiltonian sensitivity of Pontryagin’s principle — direct policy search is trajectory optimization with the model replaced by samples.
What’s next
- Week 8 learns the baseline as a critic (actor-critic), tunes the advantage with generalized advantage estimation, and adds trust regions (TRPO/PPO) to bound the policy update — turning REINFORCE’s noisy ascent into a stable, sample-reusing optimizer.
Exercises
-
(Derive) Derive the policy gradient theorem from using the log-derivative trick, and show the dynamics terms drop (Theorem 7.1).
Solution
. Since splits into initial-state, policy, and dynamics terms and only the policy terms carry , — the model cancels.
-
(Prove) Show a state-dependent baseline leaves the policy gradient unbiased, i.e. (Prop. 7.1).
Solution
. Multiplying by (constant in ) and taking the outer expectation over preserves the zero.
-
(Compute) For a softmax policy , compute the score .
Solution
— the feature, weighted by “taken minus probability.” Summed over this is zero (Prop. 7.1), the discrete face of the expected-score identity.
-
(Implement) In the companion, verify the returns-to-go computation, that the expected score is zero (the baseline mechanism), that a baseline reduces gradient variance without bias, and that REINFORCE learns CartPole above the random baseline.
Solution
See
experiments/python/week07/test_reinforce.py: a hand-checked discounted return-to-go; for a softmax network; a bandit where the baselined estimator has strictly lower variance; and a seeded CartPole run whose mean return clears the ~22 random baseline. -
(Extend) Add an entropy bonus to the objective and explain its effect on exploration. (The roadmap’s JAX
jax.gradvariant is deferred to the dedicated JAX track.)Solution
The entropy bonus rewards less-peaked policies, slowing premature convergence to a deterministic policy and sustaining exploration; its gradient adds , pushing toward higher-entropy distributions. It reappears, promoted from a bonus to the objective, in maximum-entropy RL (Week 9, SAC).
Companion code
The Week-7 companion lives at experiments/python/week07/ and is PyTorch
REINFORCE on CartPole-v1, with and without a value baseline.
reinforce.py— a softmaxPolicyNetwork, the discountedcompute_returns(returns-to-go), and a REINFORCE training loop with an optional learned-value baseline and return normalization.test_reinforce.py— component-correctness tests (hand-checked returns-to-go; the expected score that underlies Prop. 7.1; a bandit where the baselined gradient estimator has strictly lower variance) plus a seeded CartPole run learning well above the random-return baseline.
# component tests + a seeded CartPole learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week07/test_reinforce.py -q
# worked REINFORCE training run (with the value baseline)
PYTHONPATH=. python experiments/python/week07/reinforce.py --baseline --episodes 800