ExplorerData ScienceMachine Learning
Research PaperResearchia:202604.08019

Target Policy Optimization

Jean Kaddour

Abstract

In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce \emph{Target Policy Optimization} (TPO), which separates the two questions. Given scored completions, TPO ...

Submitted: April 8, 2026Subjects: Machine Learning; Data Science

Description / Details

In RL, given a prompt, we sample a group of completions from a model and score them. Two questions follow: which completions should gain probability mass, and how should the parameters move to realize that change? Standard policy-gradient methods answer both at once, so the update can overshoot or undershoot depending on the learning rate, clipping, and other optimizer choices. We introduce \emph{Target Policy Optimization} (TPO), which separates the two questions. Given scored completions, TPO constructs a target distribution qipioldexp(ui)q_i \propto p_i^{\,\mathrm{old}} \exp(u_i) and fits the policy to it by cross-entropy. The loss gradient on sampled-completion logits is pθqp^θ- q, which vanishes once the policy matches the target. On tabular bandits, transformer sequence tasks, and billion-parameter LLM RLVR, TPO matches PG, PPO, GRPO, and DG on easy tasks and substantially outperforms them under sparse reward. Code is available at https://github.com/JeanKaddour/tpo.


Source: arXiv:2604.06159v1 - http://arxiv.org/abs/2604.06159v1 PDF: https://arxiv.org/pdf/2604.06159v1 Original Link: http://arxiv.org/abs/2604.06159v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Apr 8, 2026
Topic:
Data Science
Area:
Machine Learning
Comments:
0
Bookmark