ExplorerRoboticsRobotics
Research PaperResearchia:202606.11080

VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving

Jin Yao

Abstract

Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures the policy uses them, or constrain geometry with sparse box and map losses that provide no dense spatial signal. We introduce VLGA, the first vision-language-action model supervised to reconstruct the dense 3D world it d...

Submitted: June 11, 2026Subjects: Robotics; Robotics

Description / Details

Vision-language-action (VLA) models can describe scenes and reason about them in language, yet still struggle to ground their actions in the dense 3D world around them. Existing approaches either inject features from a frozen 3D foundation model without an objective that ensures the policy uses them, or constrain geometry with sparse box and map losses that provide no dense spatial signal. We introduce VLGA, the first vision-language-action model supervised to reconstruct the dense 3D world it drives through. VLGA introduces geometry as a fourth modality alongside vision, language, and action through a dedicated expert supervised by a per-pixel pointmap regression loss against LiDAR. Extensive experiments conducted on challenging nuScenes and Bench2Drive datasets for open-loop and closed-loop evaluations, respectively, show the superiority of VLGA over counterpart VLA methods. In particular, on open-loop nuScenes, VLGA sets a new state of the art among VLA methods without ego status, with the lowest L2 (0.50,m average) and 3-second collision rate (0.18%). On closed-loop Bench2Drive, VLGA attains the state-of-the-art driving score of 79.08, +0.71 over the strongest prior VLA, at comparable efficiency and comfort.


Source: arXiv:2606.12396v1 - http://arxiv.org/abs/2606.12396v1 PDF: https://arxiv.org/pdf/2606.12396v1 Original Link: http://arxiv.org/abs/2606.12396v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 11, 2026
Topic:
Robotics
Area:
Robotics
Comments:
0
Bookmark
VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving | Researchia