ExplorerArtificial IntelligenceAI
Research PaperResearchia:202604.29002

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Chu-Cheng Lin

Abstract

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ that interpolates between RLVR (at $q{=}0$, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the density-estimation pole). All members share the same per-example gradient directi...

Submitted: April 29, 2026Subjects: AI; Artificial Intelligence

Description / Details

Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability p0p_0 is small. Using the Tsallis qq-logarithm, we define a loss family JQJ_Q that interpolates between RLVR (at q=0q{=}0, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at q=1q{=}1, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification PθqP_{θ^{-q}} that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires Ω(1p0)Ω(\frac{1}{p_0}) time to escape cold start, while the density-estimation pole escapes in Θ(log(1p0))Θ\big(\log(\frac{1}{p_0})\big); intermediate qq trades escape speed against noise memorization. Because PθP_θ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias O(qMPθq+1)O\big(\frac{q}{M P_θ^{q+1}}\big); GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at q=0.75q{=}0.75 substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low qq dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at q=0.75q{=}0.75 provides stable gradients (best overall on HotPotQA at 47.9 maj@16, +14.4+14.4 over GRPO).


Source: arXiv:2604.25907v1 - http://arxiv.org/abs/2604.25907v1 PDF: https://arxiv.org/pdf/2604.25907v1 Original Link: http://arxiv.org/abs/2604.25907v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Apr 29, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark