Part I · Foundations Week 7 Published reinforce.py test_reinforce.py

Policy Gradient Foundations

Optimizing a parameterized stochastic policy directly by gradient ascent on expected return: the policy gradient theorem via the log-derivative trick, REINFORCE, and baselines as variance-reducing control variates. Policy gradients as Monte Carlo sensitivity analysis — and the advantage that bridges to actor-critic.

On this page

The objective and the score
The policy gradient theorem
REINFORCE
Baselines as control variates
The dynamic-programming bridge
What’s next
Exercises
Companion code

Policy Gradient Foundations

Where we are. Every method so far learned a value and acted greedily — the $\argmax$ of Chapters 4–6. Policy-gradient methods discard that indirection: they parameterize the policy $\policy_\theta(a\mid s)$ and ascend the gradient of expected return directly. This sidesteps the $\max$ and its overestimation (Chapter 6), handles stochastic policies and continuous actions natively, and rests on one identity — the log-derivative trick — that turns “differentiate an expectation you can only sample” into “weight samples by the score $\nabla_\theta\log\policy_\theta$ .” The roadmap’s framing is exact: policy gradients are Monte Carlo sensitivity analysis.

Chapter 7 — at a glance

Goal. State the objective $J(\theta)$ ; prove the policy gradient theorem with the log-derivative trick; read REINFORCE off it; prove a state-dependent baseline leaves the gradient unbiased while cutting variance; and identify the advantage as the variance-minimizing weight.

Reading time. ~35 minutes; ~55 with the proofs and exercises.

Key insight — the DP bridge. Value-based RL solved the Bellman fixed point and derived a policy by $\argmax$ . Policy gradients optimize the policy as the primal object and let value functions return only as a baseline/critic (Week 8). The weight that minimizes the estimator’s variance is the advantage $\advantage = \qfn - \valuefn$ — the same advantage that drives actor-critic and that, in continuous time, is the adjoint/Hamiltonian sensitivity of optimal control (Part II).

The objective and the score

Let the policy be $\policy_\theta(a\mid s)$ , differentiable in $\theta$ . A trajectory $\tau = (S_0, A_0, R_1, \dots)$ has return $\return(\tau) = \sum_{t\ge 0}\discount^t R_{t+1}$ , and the objective is its expectation,

J(\theta) \defeq \E_{\tau\sim\policy_\theta}\!\big[\,\return(\tau)\,\big].

We cannot differentiate $J$ by differentiating the reward — the dependence on $\theta$ is through the sampling distribution of $\tau$ , not the integrand. The score function $\nabla_\theta\log\policy_\theta(a\mid s)$ is what carries that dependence, via one identity.

The policy gradient theorem

The gradient of $J$ has a famously clean form, due to Sutton et al. Sutton et al. (2000) : an expectation of the score weighted by the action-value, with no derivative of the unknown dynamics anywhere in it.

Theorem 7.1 (Policy gradient theorem).

The gradient of the expected return is

\nabla_\theta J(\theta) = \E_{\tau\sim\policy_\theta}\!\Big[\, \return(\tau)\sum_{t\ge 0}\nabla_\theta\log\policy_\theta(A_t\mid S_t)\,\Big] = \E_{\policy_\theta}\!\Big[\sum_{t\ge 0}\nabla_\theta\log\policy_\theta(A_t\mid S_t)\,\qfn_{\policy_\theta}(S_t,A_t)\Big].

No derivative of the dynamics or the reward appears — only the score of the policy.

Proof.

Write $J(\theta) = \int p_\theta(\tau)\,\return(\tau)\,d\tau$ with trajectory density $p_\theta(\tau) = p(s_0)\prod_t \policy_\theta(a_t\mid s_t)\,\transition(s_{t+1}\mid s_t,a_t)$ . Differentiate and apply the log-derivative trick $\nabla_\theta p_\theta = p_\theta\nabla_\theta\log p_\theta$ :

\begin{aligned} \nabla_\theta J &= \int \nabla_\theta p_\theta(\tau)\,\return(\tau)\,d\tau && \text{(differentiate under the integral)} \\ &= \int p_\theta(\tau)\,\nabla_\theta\log p_\theta(\tau)\,\return(\tau)\,d\tau && \text{(log-derivative trick)} \\ &= \E_{\tau}\!\big[\,\return(\tau)\,\nabla_\theta\log p_\theta(\tau)\,\big] && \text{(definition of expectation).} \end{aligned}

In $\log p_\theta(\tau) = \log p(s_0) + \sum_t\big[\log\policy_\theta(a_t\mid s_t) + \log\transition(s_{t+1}\mid s_t,a_t)\big]$ the initial-state and dynamics terms do not depend on $\theta$ , so $\nabla_\theta\log p_\theta(\tau) = \sum_t\nabla_\theta\log \policy_\theta(a_t\mid s_t)$ — the model need not be known or differentiated. That gives the first form. Using causality (an action cannot influence past rewards, $\E[\nabla_\theta\log\policy_\theta(A_t\mid S_t)R_{t'}] = 0$ for $t' \le t$ ) replaces $\return(\tau)$ by the return-to-go, whose conditional expectation is $\qfn_{\policy_\theta}(S_t,A_t)$ , giving the second. $\qquad\blacksquare$

The estimator is Monte Carlo sensitivity analysis: it estimates $\nabla_\theta\E[ \cdot]$ from samples without differentiating the sampled function, by reweighting each sample by its score.

It is unbiased and, being Monte Carlo (Chapter 3), high-variance — which the rest of the chapter attacks.

REINFORCE

Sampling the theorem’s expectation gives the REINFORCE algorithm Williams (1992) : roll out episodes under $\policy_\theta$ , form the returns-to-go $\return_t$ , and ascend

\widehat{\nabla_\theta J} = \sum_t \nabla_\theta\log\policy_\theta(A_t\mid S_t)\,\return_t, \qquad \theta \leftarrow \theta + \alpha\,\widehat{\nabla_\theta J}.

It is unbiased and model-free, the policy-space counterpart of Monte Carlo value estimation — and inherits Monte Carlo’s variance.

The single most effective fix is a baseline.

Baselines as control variates

Proposition 7.1 (A state baseline is unbiased).

For any function $b(s)$ that does not depend on the action,

\E_{\policy_\theta}\!\big[\,\nabla_\theta\log\policy_\theta(A_t\mid S_t)\,b(S_t)\,\big] = 0,

so subtracting $b(S_t)$ from the return weight in the policy gradient leaves it unbiased, while choosing $b$ to track the typical return reduces its variance.

Proof.

Condition on $S_t = s$ and average the score over actions:

\E_{a\sim\policy_\theta(\cdot\mid s)}\!\big[\nabla_\theta\log\policy_\theta(a\mid s)\big] = \sum_a \policy_\theta(a\mid s)\,\frac{\nabla_\theta\policy_\theta(a\mid s)}{\policy_\theta(a\mid s)} = \sum_a \nabla_\theta\policy_\theta(a\mid s) = \nabla_\theta\sum_a\policy_\theta(a\mid s) = \nabla_\theta 1 = 0.

Hence $\E[\nabla_\theta\log\policy_\theta(A_t\mid S_t)\,b(S_t)] = \E[b(S_t)\cdot 0] = 0$ : subtracting $b(S_t)$ changes the gradient’s variance but not its mean. The variance-minimizing choice makes the weight a centered quantity; taking $b(s) = \valuefn_{\policy_\theta}(s)$ turns the weight into the advantage $\advantage(s,a) = \qfn(s,a) - \valuefn(s)$ . $\qquad\blacksquare$

A baseline is precisely a control variate: a zero-mean term subtracted to shrink variance without moving the estimate.

With

b = \valuefn

, the policy gradient becomes

\E[\sum_t \nabla_\theta\log\policy_\theta(A_t\mid S_t)\,\advantage(S_t,A_t)]

— the form every actor-critic method (Week 8) estimates. Learning the baseline

\valuefn

is what makes it a critic.

The dynamic-programming bridge

Policy gradients invert the value-based pattern. Chapters 4–6 solved a Bellman fixed point and read off a greedy policy; here the policy is the primal object optimized by gradient ascent, and value functions re-enter only as the baseline/critic that tames variance. Three threads carry forward:

To actor-critic (Week 8). Learn $\valuefn$ (or the advantage directly) as the baseline; the policy is the actor, the value the critic. Generalized advantage estimation tunes the bias–variance of $\advantage$ , and trust regions (PPO/TRPO) control the ascent step size in policy space.
To continuous control (Week 9). No $\argmax$ over actions is ever taken, so continuous action spaces are immediate — the setting where DDPG, TD3, and SAC live.
To optimal control (Part II). Differentiating an expected cost through the dynamics is the discrete, stochastic cousin of the adjoint/Hamiltonian sensitivity of Pontryagin’s principle — direct policy search is trajectory optimization with the model replaced by samples.

What’s next

Week 8 learns the baseline as a critic (actor-critic), tunes the advantage with generalized advantage estimation, and adds trust regions (TRPO/PPO) to bound the policy update — turning REINFORCE’s noisy ascent into a stable, sample-reusing optimizer.

Exercises

(Derive) Derive the policy gradient theorem from $J(\theta) = \E_{\tau}[\return(\tau)]$ using the log-derivative trick, and show the dynamics terms drop (Theorem 7.1).

Solution
$\nabla_\theta J = \int\nabla_\theta p_\theta(\tau)\return(\tau)d\tau = \E_\tau[ \return(\tau)\nabla_\theta\log p_\theta(\tau)]$ . Since $\log p_\theta(\tau)$ splits into initial-state, policy, and dynamics terms and only the policy terms carry $\theta$ , $\nabla_\theta\log p_\theta(\tau) = \sum_t\nabla_\theta\log\policy_\theta(a_t \mid s_t)$ — the model cancels.
(Prove) Show a state-dependent baseline $b(s)$ leaves the policy gradient unbiased, i.e. $\E[\nabla_\theta\log\policy_\theta(A\mid S)\,b(S)] = 0$ (Prop. 7.1).

Solution
$\E_{a\sim\policy_\theta(\cdot\mid s)}[\nabla_\theta\log\policy_\theta(a\mid s)] = \sum_a\nabla_\theta\policy_\theta(a\mid s) = \nabla_\theta\sum_a\policy_\theta(a\mid s) = \nabla_\theta 1 = 0$ . Multiplying by $b(s)$ (constant in $a$ ) and taking the outer expectation over $S$ preserves the zero.
(Compute) For a softmax policy $\policy_\theta(a\mid s) \propto \exp(\theta_a^\top\phi(s))$ , compute the score $\nabla_{\theta_j}\log\policy_\theta(a \mid s)$ .

Solution
$\nabla_{\theta_j}\log\policy_\theta(a\mid s) = \big(\mathbf{1}[j=a] - \policy_\theta(j\mid s)\big)\phi(s)$ — the feature, weighted by “taken minus probability.” Summed over $a\sim\policy_\theta$ this is zero (Prop. 7.1), the discrete face of the expected-score identity.
(Implement) In the companion, verify the returns-to-go computation, that the expected score is zero (the baseline mechanism), that a baseline reduces gradient variance without bias, and that REINFORCE learns CartPole above the random baseline.

Solution
See experiments/python/week07/test_reinforce.py: a hand-checked discounted return-to-go; $\sum_a\policy(a\mid s)\nabla_\theta\log\policy(a\mid s)\approx 0$ for a softmax network; a bandit where the baselined estimator has strictly lower variance; and a seeded CartPole run whose mean return clears the ~22 random baseline.
(Extend) Add an entropy bonus $\beta\,\mathcal{H}(\policy_\theta(\cdot\mid s))$ to the objective and explain its effect on exploration. (The roadmap’s JAX jax.grad variant is deferred to the dedicated JAX track.)

Solution
The entropy bonus rewards less-peaked policies, slowing premature convergence to a deterministic policy and sustaining exploration; its gradient adds $\beta\nabla_\theta \mathcal{H}$ , pushing toward higher-entropy distributions. It reappears, promoted from a bonus to the objective, in maximum-entropy RL (Week 9, SAC).

Companion code

The Week-7 companion lives at experiments/python/week07/ and is PyTorch REINFORCE on CartPole-v1, with and without a value baseline.

reinforce.py — a softmax PolicyNetwork, the discounted compute_returns (returns-to-go), and a REINFORCE training loop with an optional learned-value baseline and return normalization.
test_reinforce.py — component-correctness tests (hand-checked returns-to-go; the expected score $\sum_a\policy\,\nabla\log\policy = 0$ that underlies Prop. 7.1; a bandit where the baselined gradient estimator has strictly lower variance) plus a seeded CartPole run learning well above the random-return baseline.

# component tests + a seeded CartPole learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week07/test_reinforce.py -q

# worked REINFORCE training run (with the value baseline)
PYTHONPATH=. python experiments/python/week07/reinforce.py --baseline --episodes 800