To apply stochastic variance reduction effectively, we require a rigorous definition of the gradient for the KL-regularized objective under off-policy sampling: $$J(\theta) = \mathbb{E}[R] - \beta \text{KL}$$
Unnormalized Divergences
In reasoning tasks, we often deal with unnormalized objectives. We derive exact gradients for both Unnormalized Forward KL (UFKL) and Unnormalized Reverse KL (URKL). The Unnormalized Forward KL is defined as:
$$ \text{UKL}(\pi_{\mathrm{old}}\|\pi_\theta) = \int_x \pi_{\mathrm{old}}(x)\log\frac{\pi_{\mathrm{old}}(x)}{\pi_\theta(x)}\,dx + \int_x \bigl(\pi_\theta(x)-\pi_{\mathrm{old}}(x)\bigr)\,dx $$Differentiable Surrogate Losses
Crucially for implementation in frameworks like PyTorch, we derive differentiable surrogate losses $\mathcal{L}(\theta)$ such that $\nabla_\theta \mathcal{L}(\theta)$ is an unbiased estimator of $-\nabla_\theta J(\theta)$. For the Unnormalized Reverse KL (URKL), theoretically equivalent to the $k_3$ estimator used in methods like GRPO, the surrogate loss is:
$$ \mathcal{L}_{\mathrm{URKL}}(\theta) = Z_{\mathrm{old}} \mathbb{E}_{x\sim\tilde{\pi}_{\mathrm{old}}}\left[ -w(x)R(x) + \beta\bigl(w(x)\log w(x) - w(x)\bigr) \right] $$Where $w(x) = \frac{\pi_\theta(x)}{\pi_{\mathrm{old}}(x)}$ is the importance weight.