Back to Explorer
Research PaperResearchia:202603.26007[Computer Vision > Computer Vision]

Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving

Linbo Wang

Abstract

We introduce Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-informed latent world representations. Existing world-model-based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, resulting in sub-optimal planning under constrained data and compute budgets. Latent-WAM addresses these limitations with two core modules: a Spatial-Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens via learnable queries, and a Dynamic Latent World Model (DLWM) that employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations. Extensive experiments on NAVSIM v2 and HUGSIM demonstrate new state-of-the-art results: 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, surpassing the best prior perception-free method by 3.2 EPDMS with significantly less training data and a compact 104M-parameter model.


Source: arXiv:2603.24581v1 - http://arxiv.org/abs/2603.24581v1 PDF: https://arxiv.org/pdf/2603.24581v1 Original Link: http://arxiv.org/abs/2603.24581v1

Submission:3/26/2026
Comments:0 comments
Subjects:Computer Vision; Computer Vision
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving | Researchia