ExplorerArtificial IntelligenceAI
Research PaperResearchia:202605.20053

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Utkarsh Tyagi

Abstract

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show tha...

Submitted: May 20, 2026Subjects: AI; Artificial Intelligence

Description / Details

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins 2424 of 3030 base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in 2.52.5--4×4\times fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.


Source: arXiv:2605.20164v1 - http://arxiv.org/abs/2605.20164v1 PDF: https://arxiv.org/pdf/2605.20164v1 Original Link: http://arxiv.org/abs/2605.20164v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
May 20, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR | Researchia