ExplorerArtificial IntelligenceAI
Research PaperResearchia:202605.18006

Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

Jin Shi

Abstract

Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF a...

Submitted: May 18, 2026Subjects: AI; Artificial Intelligence

Description / Details

Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a 44×44\times reduction in model size while matching the teacher with only a 0.27%0.27\% average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a 3.28×3.28\times inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different π0.5π_{0.5}-4B teacher, where the student outperforms the teacher on two suites and remains within 0.53%0.53\% on \texttt{libero_goal}. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.


Source: arXiv:2605.16241v1 - http://arxiv.org/abs/2605.16241v1 PDF: https://arxiv.org/pdf/2605.16241v1 Original Link: http://arxiv.org/abs/2605.16241v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
May 18, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark