
On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. SDPG instantiates this signal as an auxiliary full-vocabulary student-to-teacher reverse KL loss and combines it with group-relative verifier advantages, normalized standard deviation, and reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines.



