LLM Reasoning Post-Training

Self-Distilled Policy Gradient

SDPG combines verifier-based RLVR with exact full-vocabulary on-policy self-distillation, turning privileged reasoning context into dense token-level supervision while preserving reward-driven exploration.

Yifeng Liu*, Shiyuan Zhang*, Yifan Zhang*, Quanquan Gu†
UCLA  ·  Princeton AI Lab  ·  arXiv:2606.04036
RLVRSelf-DistillationFull-Vocabulary OPDLLM Reasoning

Abstract

SDPG framework overview.
Figure: SDPG trains one deployable policy under two views of the same model — an ordinary student and a privileged teacher.

On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. SDPG instantiates this signal as an auxiliary full-vocabulary student-to-teacher reverse KL loss and combines it with group-relative verifier advantages, normalized standard deviation, and reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines.

$\mathcal L_{\mathrm{SDPG}} = \mathcal L_{\mathrm{out}} + \beta(k)\,\mathcal L_{\mathrm{OPD}}^{+} + \alpha\,\mathcal L_{\mathcal K}$

Read the Paper Code

Verifier Rewards + Privileged Self-Distillation

SDPG trains one deployable policy under two views of the same model: an ordinary student view that sees only the problem, and a privileged teacher view that additionally sees answer-side context. The verifier remains the final arbiter, while the privileged distribution shapes token-level credit assignment on useful rollouts.

01

Outcome policy gradient

Keeps the binary verifier objective of RLVR and computes group-relative advantages over sampled responses, preserving the selection pressure that helps discover correct solutions.

02

Full-vocabulary OPD

The same model serves as student and privileged teacher. On sampled prefixes, SDPG minimizes $D_{\mathrm{KL}}(p_t\|\mathrm{SG}[q_t])$, giving dense token-level guidance without a separate larger teacher.

03

Policy anchor & gates

Reference-policy KL, positive-advantage gating, and a warmup-decay schedule for $\beta(k)$ keep the privileged signal useful without over-constraining the reasoning policy.

Local policy-gradient view

With the privileged branch detached, reverse-KL OPD has the same fixed-prefix student-side gradient as a detached-sampling update with a centered log teacher/student ratio advantage.

$\nabla_\theta D_{\mathrm{KL}}(p_t\|\mathrm{SG}[q_t]) \;\Longleftrightarrow\; \nabla_\theta\,\mathbb E_{a\sim p_t}[-\log p_t(a)\,\mathrm{SG}(\bar D_t-\log \bar p_t(a)/\bar q_t(a))]$

What Each Term Contributes

The SDPG loss is deliberately decomposed into sparse selection, dense privileged guidance, and policy anchoring — making the optimization behavior easier to reason about than a black-box auxiliary loss.

1. Group-relative outcome advantage

$A_{\mathrm{out}}^{(i)} = \dfrac{R(x,y^{(i)})-\mu_G}{\sigma_G+\epsilon_{\mathrm{std}}}$

The verifier supplies sequence-level rewards; normalizing within a group keeps the update comparative — correct rollouts promoted, poor ones suppressed.

2. Exact full-vocabulary OPD

$\ell^{\mathrm{OPD}}_{i,t} = \sum_{a\in\mathcal V}p_{i,t}(a)\log\dfrac{p_{i,t}(a)}{\mathrm{SG}[q_{i,t}(a)]}$

Rather than distilling only the sampled token, SDPG compares the full next-token distributions, giving dense supervision over every vocabulary item.

3. Positive-advantage gate

$m_i=\mathbf 1[A_{\mathrm{out}}^{(i)}>0],\quad \mathcal L_{\mathrm{OPD}}^{+}=\mathbb E\!\left[\sum_{i,t}m_i\,\ell^{\mathrm{OPD}}_{i,t}\right]$

Privileged context can produce plausible continuations on a globally wrong trajectory; the gate applies OPD only when the verifier prefers the rollout.

4. Warmup-decay distillation weight

$\beta(k)=\beta_{\mathrm{base}}\min(1,k/T_{\mathrm{warm}})\min(1,(T-k)/T_{\mathrm{decay}})$

Early warmup avoids trusting a noisy privileged target too soon; late decay releases the model from inference-unavailable information once the signal is internalized.

Two KL anchors used in SDPG

SDPG evaluates unnormalized KL regularization against a fixed reference policy. The reverse form penalizes squared log drift; the forward form uses an inverse-ratio plus log-ratio term. Both keep the student close enough to the reference that dense distillation does not dominate the reward objective.

URKL surrogate: $\mathcal L_{\mathrm{URKL}} = \mathcal L_{\mathrm{R\&D}}+\alpha\,\mathbb E[\tfrac{1}{2}\log^2\tfrac{\pi_\theta}{\pi_{\mathrm{ref}}}]$

UFKL surrogate: $\mathcal L_{\mathrm{UFKL}} = \mathcal L_{\mathrm{R\&D}}+\alpha\,\mathbb E[\tfrac{\pi_{\mathrm{ref}}}{\pi_\theta}+\log\tfrac{\pi_\theta}{\pi_{\mathrm{ref}}}]$

Beta schedule (warmup then decay).
The distillation coefficient warms up, then decays — strongest after initial exploration finds useful trajectories, weaker near the end of training.

Training loop

  1. Sample prompts with privileged contexts $(x,c)$ from the training set.
  2. Generate a group of responses from the unprivileged policy $\pi_\theta(\cdot\mid x)$.
  3. Score each response with the binary verifier and compute group-relative advantages.
  4. For positively advantaged rollouts, compute full-vocabulary OPD against the detached privileged distribution.
  5. Update the policy with outcome loss, gated OPD, and reference-policy KL regularization.

Stable Gains on Mathematical Reasoning

Experiments use Qwen3 models trained for 400 steps on DAPO-Math-17k and evaluated on AIME2024, AIME2025, and AMC23 with pass@1 mean@32.

Early lift

The accuracy gap between SDPG and GRPO opens within roughly the first 50 steps on the 4B run.

Entropy stability

SDPG-UFKL keeps actor entropy substantially higher, while RLSD collapses toward zero around step 250.

Shorter reasoning

SDPG response lengths settle between terse collapse and GRPO's more verbose outputs.

Method (Qwen3-4B, 400 steps)AIME24 LastBestAIME25 LastBestAMC23 LastBest
GRPO0.2800.3160.2420.2790.7140.739
RLSD0.3780.3950.3000.3040.8130.813
SDPG-URKL0.3800.4010.3070.3080.8630.863
SDPG-UFKL0.3800.4080.3270.3350.8580.870
Qwen3-4B training curves.
Qwen3-4B training dynamics: SDPG variants improve AIME and AMC accuracy, reach reward plateaus earlier, and avoid the entropy collapse seen in RLSD.
Qwen3-1.7B training curves.
Qwen3-1.7B results show the same pattern: SDPG outperforms GRPO, RLSD, and a pure self-distillation baseline across benchmarks.

Why the Pieces Matter

Removing the OPD term loses the early-training accuracy advantage on AIME24 and AIME25, confirming that privileged full-vocabulary distillation is the main source of fast convergence on hard tasks. Removing the reference-policy KL can keep accuracy competitive but leads to shortened responses and rising entropy — weaker control over coherent reasoning.

Qwen3-4B ablation curves.
Qwen3-4B ablation over OPD and KL regularization. Removing $\beta$ removes the dense OPD signal; removing $\alpha$ weakens the reference-policy anchor.

Citation

If you find this work useful, please cite:

@article{liu2026self,
  title  = {Self-Distilled Policy Gradient},
  author = {Liu, Yifeng and Zhang, Shiyuan and Zhang, Yifan and Gu, Quanquan},
  journal= {arXiv preprint arXiv:2606.04036},
  year   = {2026}
}

(* equal contribution, † corresponding author)