ExplorerArtificial IntelligenceAI
Research PaperResearchia:202604.17051

TIP: Token Importance in On-Policy Distillation

Yuanda Xu

Abstract

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is o...

Submitted: April 17, 2026Subjects: AI; Artificial Intelligence

Description / Details

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining 50%50\% of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to 47%47\%. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than 10%10\% of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on <$$20\% of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.


Source: arXiv:2604.14084v1 - http://arxiv.org/abs/2604.14084v1 PDF: https://arxiv.org/pdf/2604.14084v1 Original Link: http://arxiv.org/abs/2604.14084v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Apr 17, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark
TIP: Token Importance in On-Policy Distillation | Researchia