ExplorerRoboticsRobotics
Research PaperResearchia:202605.07055

From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

Yihan Lin

Abstract

Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragmented and lack a systematic comparison. This work structures the study of latent action supervision from two perspectives: (i) regularizing the trajectory via image-based latent actions, and (ii) unifying the target space with action-based latent actions. Under a unified V...

Submitted: May 7, 2026Subjects: Robotics; Robotics

Description / Details

Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragmented and lack a systematic comparison. This work structures the study of latent action supervision from two perspectives: (i) regularizing the trajectory via image-based latent actions, and (ii) unifying the target space with action-based latent actions. Under a unified VLA baseline, we instantiate and compare four representative integration strategies. Our results reveal a formulation-task correspondence: image-based latent actions benefit long-horizon reasoning and scene-level generalization, whereas action-based latent actions excel at complex motor coordination. Furthermore, we find that directly supervising the VLM with discrete latent action tokens yields the most effective performance. Finally, our experiments offer initial insights into the benefits of latent action supervision in mixed-data, suggesting a promising direction for VLA training. Code is available at https://github.com/RUCKBReasoning/From_Pixels_to_Tokens.


Source: arXiv:2605.04678v1 - http://arxiv.org/abs/2605.04678v1 PDF: https://arxiv.org/pdf/2605.04678v1 Original Link: http://arxiv.org/abs/2605.04678v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
May 7, 2026
Topic:
Robotics
Area:
Robotics
Comments:
0
Bookmark
From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models | Researchia