ExplorerData ScienceMachine Learning
Research PaperResearchia:202606.30005

Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models

Subramanyam Sahoo

Abstract

Conservative offline training is widely advocated as a safe foundation for subsequent online adaptation: if a policy stays close to well-supported behaviour, the argument goes, it is less likely to exploit imperfections in a learned reward model. We challenge this intuition empirically and mechanistically. We train a Qwen3-14B policy under Direct Preference Optimisation (DPO) with three levels of conservatism ($β\in \{β_{\mathrm{lo}}, β_{\mathrm{mid}}, β_{\mathrm{hi}}\}$ derived from empirical l...

Submitted: June 30, 2026Subjects: Machine Learning; Data Science

Description / Details

Conservative offline training is widely advocated as a safe foundation for subsequent online adaptation: if a policy stays close to well-supported behaviour, the argument goes, it is less likely to exploit imperfections in a learned reward model. We challenge this intuition empirically and mechanistically. We train a Qwen3-14B policy under Direct Preference Optimisation (DPO) with three levels of conservatism (β{βlo,βmid,βhi}β\in \{β_{\mathrm{lo}}, β_{\mathrm{mid}}, β_{\mathrm{hi}}\} derived from empirical log-ratio percentiles), then adapt each checkpoint online against a learned reward ensemble (3,×\times,Qwen3-1.7B) while measuring true performance on GSM8K exact-answer accuracy. We find that \emph{higher offline conservatism monotonically increases reward-hacking damage}, measured by the Goodhart gap and its area under the curve (AUGC), with Spearman ρ=1.0ρ= 1.0 across all three conditions. Mechanistic analysis reveals a three-link causal chain: (i) high-ββ DPO compresses policy entropy, (ii) Low-entropy policies generate responses with reduced diversity, concentrating in a narrow region of the reward model's training distribution (lower pairwise cosine distance), and (iii) despite this proximity, ensemble disagreement (epistemic uncertainty) increases with ββ and is exploited faster during online optimisation. We further fit a power-law curve to the (β,\augc)(β, \augc) data and identify a practical optimal conservatism level ββ^{\star} that balances alignment fidelity against hacking vulnerability. Our results suggest that the field needs \emph{calibrated}, not \emph{maximal}, conservatism.


Source: arXiv:2606.30627v1 - http://arxiv.org/abs/2606.30627v1 PDF: https://arxiv.org/pdf/2606.30627v1 Original Link: http://arxiv.org/abs/2606.30627v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 30, 2026
Topic:
Data Science
Area:
Machine Learning
Comments:
0
Bookmark
Pessimism's Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models | Researchia