Target Policy Optimization
Abstract
In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce \emph{Target Policy Optimization} (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is , which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.
Source: arXiv:2604.06159v1 - http://arxiv.org/abs/2604.06159v1 PDF: https://arxiv.org/pdf/2604.06159v1 Original Link: http://arxiv.org/abs/2604.06159v1