Back to Explorer
Research PaperResearchia:202603.13090[Robotics > Robotics]

SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

Mengzhen Liu

Abstract

Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven active perception with robust, viewpoint-invariant execution. We propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Our approach decouples camera and manipulation actions rather than placing them in a shared action space, and follows a bottom-up training strategy: we first train semantic camera control on a large-scale dataset, then jointly optimize both action types using hybrid data. To support this framework, we introduce ActiveViewPose-200K, a dataset of 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We also present ActiveManip-Bench, the first benchmark for evaluating active manipulation beyond fixed-view settings. Extensive experiments in both simulation and real-world environments show that SaPaVe outperforms recent vision-language-action models such as GR00T N1 and (ฯ€_0), achieving up to 31.25% higher success rates in real-world tasks. These results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation. Project page: https://lmzpai.github.io/SaPaVe


Source: arXiv:2603.12193v1 - http://arxiv.org/abs/2603.12193v1 PDF: https://arxiv.org/pdf/2603.12193v1 Original Link: http://arxiv.org/abs/2603.12193v1

Submission:3/13/2026
Comments:0 comments
Subjects:Robotics; Robotics
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!