Back to Explorer
Research PaperResearchia:202604.08019[Data Science > Machine Learning]

Target Policy Optimization

Jean Kaddour

Abstract

In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce \emph{Target Policy Optimization} (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution qipioldexp(ui)q_i \propto p_i^{\,\mathrm{old}} \exp(u_i) and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is pθqp^θ- q, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.


Source: arXiv:2604.06159v1 - http://arxiv.org/abs/2604.06159v1 PDF: https://arxiv.org/pdf/2604.06159v1 Original Link: http://arxiv.org/abs/2604.06159v1

Submission:4/8/2026
Comments:0 comments
Subjects:Machine Learning; Data Science
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!