Self-Distilled Policy Gradient

Yifeng Liu; Shiyuan Zhang; Yifan Zhang; Quanquan Gu

Abstract

SDPG framework overview. — **Figure:** SDPG trains one deployable policy under two views of the same model — an ordinary student and a privileged teacher.

On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. SDPG instantiates this signal as an auxiliary full-vocabulary student-to-teacher reverse KL loss and combines it with group-relative verifier advantages, normalized standard deviation, and reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines.

\mathcal L_{\mathrm{SDPG}} = \mathcal L_{\mathrm{out}} + \beta(k)\,\mathcal L_{\mathrm{OPD}}^{+} + \alpha\,\mathcal L_{\mathcal K}

Read the Paper Code

Verifier Rewards + Privileged Self-Distillation

SDPG trains one deployable policy under two views of the same model: an ordinary student view that sees only the problem, and a privileged teacher view that additionally sees answer-side context. The verifier remains the final arbiter, while the privileged distribution shapes token-level credit assignment on useful rollouts.

01

Outcome policy gradient

Keeps the binary verifier objective of RLVR and computes group-relative advantages over sampled responses, preserving the selection pressure that helps discover correct solutions.

02

Full-vocabulary OPD

The same model serves as student and privileged teacher. On sampled prefixes, SDPG minimizes $D_{\mathrm{KL}}(p_t\|\mathrm{SG}[q_t])$, giving dense token-level guidance without a separate larger teacher.

03

Policy anchor & gates

Reference-policy KL, positive-advantage gating, and a warmup-decay schedule for $\beta(k)$ keep the privileged signal useful without over-constraining the reasoning policy.

Local policy-gradient view

With the privileged branch detached, reverse-KL OPD has the same fixed-prefix student-side gradient as a detached-sampling update with a centered log teacher/student ratio advantage.

\nabla_\theta D_{\mathrm{KL}}(p_t\|\mathrm{SG}[q_t]) \;\Longleftrightarrow\; \nabla_\theta\,\mathbb E_{a\sim p_t}[-\log p_t(a)\,\mathrm{SG}(\log \bar q_t(a)/\bar p_t(a)+\bar D_t)]

What Each Term Contributes

The SDPG loss is deliberately decomposed into sparse selection, dense privileged guidance, and policy anchoring — making the optimization behavior easier to reason about than a black-box auxiliary loss.

1. Group-relative outcome advantage

A_{\mathrm{out}}^{(i)} = \dfrac{R(x,y^{(i)})-\mu_G}{\sigma_G+\epsilon_{\mathrm{std}}}

The verifier supplies sequence-level rewards; normalizing within a group keeps the update comparative — correct rollouts promoted, poor ones suppressed.

2. Exact full-vocabulary OPD

\ell^{\mathrm{OPD}}_{i,t} = \sum_{a\in\mathcal V}p_{i,t}(a)\log\dfrac{p_{i,t}(a)}{\mathrm{SG}[q_{i,t}(a)]}

Rather than distilling only the sampled token, SDPG compares the full next-token distributions, giving dense supervision over every vocabulary item.

3. Positive-advantage gate

m_i=\mathbf 1[A_{\mathrm{out}}^{(i)}>0],\quad \mathcal L_{\mathrm{OPD}}^{+}=\mathbb E\!\left[\sum_{i,t}m_i\,\ell^{\mathrm{OPD}}_{i,t}\right]

Privileged context can produce plausible continuations on a globally wrong trajectory; the gate applies OPD only when the verifier prefers the rollout.

4. Warmup-decay distillation weight

\beta(k)=\beta_{\mathrm{base}}\min(1,k/T_{\mathrm{warm}})\min(1,(T-k)/T_{\mathrm{decay}})

Early warmup avoids trusting a noisy privileged target too soon; late decay releases the model from inference-unavailable information once the signal is internalized.

Two KL anchors used in SDPG

SDPG evaluates unnormalized KL regularization against a fixed reference policy. The reverse form penalizes squared log drift; the forward form uses an inverse-ratio plus log-ratio term. Both keep the student close enough to the reference that dense distillation does not dominate the reward objective.

URKL surrogate: $\mathcal L_{\mathrm{URKL}} = \mathcal L_{\mathrm{R\&D}}+\alpha\,\mathbb E[\tfrac{1}{2}\log^2\tfrac{\pi_\theta}{\pi_{\mathrm{ref}}}]$

UFKL surrogate: $\mathcal L_{\mathrm{UFKL}} = \mathcal L_{\mathrm{R\&D}}+\alpha\,\mathbb E[\tfrac{\pi_{\mathrm{ref}}}{\pi_\theta}+\log\tfrac{\pi_\theta}{\pi_{\mathrm{ref}}}]$

Beta schedule (warmup then decay). — The distillation coefficient warms up, then decays — strongest after initial exploration finds useful trajectories, weaker near the end of training.

Training loop

Sample prompts with privileged contexts $(x,c)$ from the training set.
Generate a group of responses from the unprivileged policy $\pi_\theta(\cdot\mid x)$.
Score each response with the binary verifier and compute group-relative advantages.
For positively advantaged rollouts, compute full-vocabulary OPD against the detached privileged distribution.
Update the policy with outcome loss, gated OPD, and reference-policy KL regularization.

Stable Gains on Mathematical Reasoning

Experiments use Qwen3 models trained for 400 steps on DAPO-Math-17k and evaluated on AIME2024, AIME2025, and AMC23 with pass@1 mean@32.

Early lift

The accuracy gap between SDPG and GRPO opens within roughly the first 50 steps on the 4B run.

Entropy stability

SDPG-UFKL keeps actor entropy substantially higher, while RLSD collapses toward zero around step 250.

Shorter reasoning

SDPG response lengths settle between terse collapse and GRPO's more verbose outputs.

Method (Qwen3-4B, 400 steps)	AIME24 Last	Best	AIME25 Last	Best	AMC23 Last	Best
GRPO	0.280	0.316	0.242	0.279	0.714	0.739
RLSD	0.378	0.395	0.300	0.304	0.813	0.813
SDPG-URKL	0.380	0.401	0.307	0.308	0.863	0.863
SDPG-UFKL	0.380	0.408	0.327	0.335	0.858	0.870

Qwen3-4B training curves. — Qwen3-4B training dynamics: SDPG variants improve AIME and AMC accuracy, reach reward plateaus earlier, and avoid the entropy collapse seen in RLSD.

Qwen3-1.7B training curves. — Qwen3-1.7B results show the same pattern: SDPG outperforms GRPO, RLSD, and a pure self-distillation baseline across benchmarks.

Why the Pieces Matter

Removing the OPD term loses the early-training accuracy advantage on AIME24 and AIME25, confirming that privileged full-vocabulary distillation is the main source of fast convergence on hard tasks. Removing the reference-policy KL can keep accuracy competitive but leads to shortened responses and rising entropy — weaker control over coherent reasoning.

Qwen3-4B ablation curves. — Qwen3-4B ablation over OPD and KL regularization. Removing $\beta$ removes the dense OPD signal; removing $\alpha$ weakens the reference-policy anchor.

Citation

If you find this work useful, please cite:

@article{liu2026self,
  title  = {Self-Distilled Policy Gradient},
  author = {Liu, Yifeng and Zhang, Shiyuan and Zhang, Yifan and Gu, Quanquan},
  journal= {arXiv preprint arXiv:2606.04036},
  year   = {2026}
}

(* equal contribution, † corresponding author)