Part I · Foundations Week 7 Published reinforce.py test_reinforce.py

Policy Gradient Foundations

Optimizing a parameterized stochastic policy directly by gradient ascent on expected return: the policy gradient theorem via the log-derivative trick, REINFORCE, and baselines as variance-reducing control variates. Policy gradients as Monte Carlo sensitivity analysis — and the advantage that bridges to actor-critic.

On this page
  1. The objective and the score
  2. The policy gradient theorem
  3. REINFORCE
  4. Baselines as control variates
  5. The dynamic-programming bridge
  6. What’s next
  7. Exercises
  8. Companion code

Policy Gradient Foundations

Where we are. Every method so far learned a value and acted greedily — the arg max\argmax of Chapters 4–6. Policy-gradient methods discard that indirection: they parameterize the policy πθ(as)\policy_\theta(a\mid s) and ascend the gradient of expected return directly. This sidesteps the max\max and its overestimation (Chapter 6), handles stochastic policies and continuous actions natively, and rests on one identity — the log-derivative trick — that turns “differentiate an expectation you can only sample” into “weight samples by the score θlogπθ\nabla_\theta\log\policy_\theta.” The roadmap’s framing is exact: policy gradients are Monte Carlo sensitivity analysis.

The objective and the score

Let the policy be πθ(as)\policy_\theta(a\mid s), differentiable in θ\theta. A trajectory τ=(S0,A0,R1,)\tau = (S_0, A_0, R_1, \dots) has return G(τ)=t0γtRt+1\return(\tau) = \sum_{t\ge 0}\discount^t R_{t+1}, and the objective is its expectation,

J(θ)Eτπθ ⁣[G(τ)].J(\theta) \defeq \E_{\tau\sim\policy_\theta}\!\big[\,\return(\tau)\,\big].

We cannot differentiate JJ by differentiating the reward — the dependence on θ\theta is through the sampling distribution of τ\tau, not the integrand. The score function θlogπθ(as)\nabla_\theta\log\policy_\theta(a\mid s) is what carries that dependence, via one identity.

The policy gradient theorem

The gradient of JJ has a famously clean form, due to Sutton et al. Sutton et al. (2000) : an expectation of the score weighted by the action-value, with no derivative of the unknown dynamics anywhere in it.

Theorem 7.1 (Policy gradient theorem).

The gradient of the expected return is

θJ(θ)=Eτπθ ⁣[G(τ)t0θlogπθ(AtSt)]=Eπθ ⁣[t0θlogπθ(AtSt)qπθ(St,At)].\nabla_\theta J(\theta) = \E_{\tau\sim\policy_\theta}\!\Big[\, \return(\tau)\sum_{t\ge 0}\nabla_\theta\log\policy_\theta(A_t\mid S_t)\,\Big] = \E_{\policy_\theta}\!\Big[\sum_{t\ge 0}\nabla_\theta\log\policy_\theta(A_t\mid S_t)\,\qfn_{\policy_\theta}(S_t,A_t)\Big].

No derivative of the dynamics or the reward appears — only the score of the policy.

Proof.

Write J(θ)=pθ(τ)G(τ)dτJ(\theta) = \int p_\theta(\tau)\,\return(\tau)\,d\tau with trajectory density pθ(τ)=p(s0)tπθ(atst)p(st+1st,at)p_\theta(\tau) = p(s_0)\prod_t \policy_\theta(a_t\mid s_t)\,\transition(s_{t+1}\mid s_t,a_t). Differentiate and apply the log-derivative trick θpθ=pθθlogpθ\nabla_\theta p_\theta = p_\theta\nabla_\theta\log p_\theta:

θJ=θpθ(τ)G(τ)dτ(differentiate under the integral)=pθ(τ)θlogpθ(τ)G(τ)dτ(log-derivative trick)=Eτ ⁣[G(τ)θlogpθ(τ)](definition of expectation).\begin{aligned} \nabla_\theta J &= \int \nabla_\theta p_\theta(\tau)\,\return(\tau)\,d\tau && \text{(differentiate under the integral)} \\ &= \int p_\theta(\tau)\,\nabla_\theta\log p_\theta(\tau)\,\return(\tau)\,d\tau && \text{(log-derivative trick)} \\ &= \E_{\tau}\!\big[\,\return(\tau)\,\nabla_\theta\log p_\theta(\tau)\,\big] && \text{(definition of expectation).} \end{aligned}

In logpθ(τ)=logp(s0)+t[logπθ(atst)+logp(st+1st,at)]\log p_\theta(\tau) = \log p(s_0) + \sum_t\big[\log\policy_\theta(a_t\mid s_t) + \log\transition(s_{t+1}\mid s_t,a_t)\big] the initial-state and dynamics terms do not depend on θ\theta, so θlogpθ(τ)=tθlogπθ(atst)\nabla_\theta\log p_\theta(\tau) = \sum_t\nabla_\theta\log \policy_\theta(a_t\mid s_t)the model need not be known or differentiated. That gives the first form. Using causality (an action cannot influence past rewards, E[θlogπθ(AtSt)Rt]=0\E[\nabla_\theta\log\policy_\theta(A_t\mid S_t)R_{t'}] = 0 for ttt' \le t) replaces G(τ)\return(\tau) by the return-to-go, whose conditional expectation is qπθ(St,At)\qfn_{\policy_\theta}(S_t,A_t), giving the second. \qquad\blacksquare

The estimator is Monte Carlo sensitivity analysis: it estimates θE[]\nabla_\theta\E[ \cdot] from samples without differentiating the sampled function, by reweighting each sample by its score.

It is unbiased and, being Monte Carlo (Chapter 3), high-variance — which the rest of the chapter attacks.

REINFORCE

Sampling the theorem’s expectation gives the REINFORCE algorithm Williams (1992) : roll out episodes under πθ\policy_\theta, form the returns-to-go Gt\return_t, and ascend

θJ^=tθlogπθ(AtSt)Gt,θθ+αθJ^.\widehat{\nabla_\theta J} = \sum_t \nabla_\theta\log\policy_\theta(A_t\mid S_t)\,\return_t, \qquad \theta \leftarrow \theta + \alpha\,\widehat{\nabla_\theta J}.

It is unbiased and model-free, the policy-space counterpart of Monte Carlo value estimation — and inherits Monte Carlo’s variance.

The single most effective fix is a baseline.

Baselines as control variates

Proposition 7.1 (A state baseline is unbiased).

For any function b(s)b(s) that does not depend on the action,

Eπθ ⁣[θlogπθ(AtSt)b(St)]=0,\E_{\policy_\theta}\!\big[\,\nabla_\theta\log\policy_\theta(A_t\mid S_t)\,b(S_t)\,\big] = 0,

so subtracting b(St)b(S_t) from the return weight in the policy gradient leaves it unbiased, while choosing bb to track the typical return reduces its variance.

Proof.

Condition on St=sS_t = s and average the score over actions:

Eaπθ(s) ⁣[θlogπθ(as)]=aπθ(as)θπθ(as)πθ(as)=aθπθ(as)=θaπθ(as)=θ1=0.\E_{a\sim\policy_\theta(\cdot\mid s)}\!\big[\nabla_\theta\log\policy_\theta(a\mid s)\big] = \sum_a \policy_\theta(a\mid s)\,\frac{\nabla_\theta\policy_\theta(a\mid s)}{\policy_\theta(a\mid s)} = \sum_a \nabla_\theta\policy_\theta(a\mid s) = \nabla_\theta\sum_a\policy_\theta(a\mid s) = \nabla_\theta 1 = 0.

Hence E[θlogπθ(AtSt)b(St)]=E[b(St)0]=0\E[\nabla_\theta\log\policy_\theta(A_t\mid S_t)\,b(S_t)] = \E[b(S_t)\cdot 0] = 0: subtracting b(St)b(S_t) changes the gradient’s variance but not its mean. The variance-minimizing choice makes the weight a centered quantity; taking b(s)=vπθ(s)b(s) = \valuefn_{\policy_\theta}(s) turns the weight into the advantage A(s,a)=q(s,a)v(s)\advantage(s,a) = \qfn(s,a) - \valuefn(s). \qquad\blacksquare

A baseline is precisely a control variate: a zero-mean term subtracted to shrink variance without moving the estimate.

With b=vb = \valuefn, the policy gradient becomes E[tθlogπθ(AtSt)A(St,At)]\E[\sum_t \nabla_\theta\log\policy_\theta(A_t\mid S_t)\,\advantage(S_t,A_t)] — the form every actor-critic method (Week 8) estimates. Learning the baseline v\valuefn is what makes it a critic.

The dynamic-programming bridge

Policy gradients invert the value-based pattern. Chapters 4–6 solved a Bellman fixed point and read off a greedy policy; here the policy is the primal object optimized by gradient ascent, and value functions re-enter only as the baseline/critic that tames variance. Three threads carry forward:

  • To actor-critic (Week 8). Learn v\valuefn (or the advantage directly) as the baseline; the policy is the actor, the value the critic. Generalized advantage estimation tunes the bias–variance of A\advantage, and trust regions (PPO/TRPO) control the ascent step size in policy space.
  • To continuous control (Week 9). No arg max\argmax over actions is ever taken, so continuous action spaces are immediate — the setting where DDPG, TD3, and SAC live.
  • To optimal control (Part II). Differentiating an expected cost through the dynamics is the discrete, stochastic cousin of the adjoint/Hamiltonian sensitivity of Pontryagin’s principle — direct policy search is trajectory optimization with the model replaced by samples.

What’s next

  • Week 8 learns the baseline as a critic (actor-critic), tunes the advantage with generalized advantage estimation, and adds trust regions (TRPO/PPO) to bound the policy update — turning REINFORCE’s noisy ascent into a stable, sample-reusing optimizer.

Exercises

  1. (Derive) Derive the policy gradient theorem from J(θ)=Eτ[G(τ)]J(\theta) = \E_{\tau}[\return(\tau)] using the log-derivative trick, and show the dynamics terms drop (Theorem 7.1).

    Solution

    θJ=θpθ(τ)G(τ)dτ=Eτ[G(τ)θlogpθ(τ)]\nabla_\theta J = \int\nabla_\theta p_\theta(\tau)\return(\tau)d\tau = \E_\tau[ \return(\tau)\nabla_\theta\log p_\theta(\tau)]. Since logpθ(τ)\log p_\theta(\tau) splits into initial-state, policy, and dynamics terms and only the policy terms carry θ\theta, θlogpθ(τ)=tθlogπθ(atst)\nabla_\theta\log p_\theta(\tau) = \sum_t\nabla_\theta\log\policy_\theta(a_t \mid s_t) — the model cancels.

  2. (Prove) Show a state-dependent baseline b(s)b(s) leaves the policy gradient unbiased, i.e. E[θlogπθ(AS)b(S)]=0\E[\nabla_\theta\log\policy_\theta(A\mid S)\,b(S)] = 0 (Prop. 7.1).

    Solution

    Eaπθ(s)[θlogπθ(as)]=aθπθ(as)=θaπθ(as)=θ1=0\E_{a\sim\policy_\theta(\cdot\mid s)}[\nabla_\theta\log\policy_\theta(a\mid s)] = \sum_a\nabla_\theta\policy_\theta(a\mid s) = \nabla_\theta\sum_a\policy_\theta(a\mid s) = \nabla_\theta 1 = 0. Multiplying by b(s)b(s) (constant in aa) and taking the outer expectation over SS preserves the zero.

  3. (Compute) For a softmax policy πθ(as)exp(θaϕ(s))\policy_\theta(a\mid s) \propto \exp(\theta_a^\top\phi(s)), compute the score θjlogπθ(as)\nabla_{\theta_j}\log\policy_\theta(a \mid s).

    Solution

    θjlogπθ(as)=(1[j=a]πθ(js))ϕ(s)\nabla_{\theta_j}\log\policy_\theta(a\mid s) = \big(\mathbf{1}[j=a] - \policy_\theta(j\mid s)\big)\phi(s) — the feature, weighted by “taken minus probability.” Summed over aπθa\sim\policy_\theta this is zero (Prop. 7.1), the discrete face of the expected-score identity.

  4. (Implement) In the companion, verify the returns-to-go computation, that the expected score is zero (the baseline mechanism), that a baseline reduces gradient variance without bias, and that REINFORCE learns CartPole above the random baseline.

    Solution

    See experiments/python/week07/test_reinforce.py: a hand-checked discounted return-to-go; aπ(as)θlogπ(as)0\sum_a\policy(a\mid s)\nabla_\theta\log\policy(a\mid s)\approx 0 for a softmax network; a bandit where the baselined estimator has strictly lower variance; and a seeded CartPole run whose mean return clears the ~22 random baseline.

  5. (Extend) Add an entropy bonus βH(πθ(s))\beta\,\mathcal{H}(\policy_\theta(\cdot\mid s)) to the objective and explain its effect on exploration. (The roadmap’s JAX jax.grad variant is deferred to the dedicated JAX track.)

    Solution

    The entropy bonus rewards less-peaked policies, slowing premature convergence to a deterministic policy and sustaining exploration; its gradient adds βθH\beta\nabla_\theta \mathcal{H}, pushing toward higher-entropy distributions. It reappears, promoted from a bonus to the objective, in maximum-entropy RL (Week 9, SAC).

Companion code

The Week-7 companion lives at experiments/python/week07/ and is PyTorch REINFORCE on CartPole-v1, with and without a value baseline.

  • reinforce.py — a softmax PolicyNetwork, the discounted compute_returns (returns-to-go), and a REINFORCE training loop with an optional learned-value baseline and return normalization.
  • test_reinforce.py — component-correctness tests (hand-checked returns-to-go; the expected score aπlogπ=0\sum_a\policy\,\nabla\log\policy = 0 that underlies Prop. 7.1; a bandit where the baselined gradient estimator has strictly lower variance) plus a seeded CartPole run learning well above the random-return baseline.
# component tests + a seeded CartPole learning check (PyTorch; ~1-2 min on CPU)
PYTHONPATH=. pytest experiments/python/week07/test_reinforce.py -q

# worked REINFORCE training run (with the value baseline)
PYTHONPATH=. python experiments/python/week07/reinforce.py --baseline --episodes 800