ExplorerRoboticsRobotics
Research PaperResearchia:202605.25011

Point Tracking Improves World Action Models

Jiarui Guan

Abstract

Robot policy learning benefits from world-action models that capture environment dynamics, but pixel-level prediction entangles dynamics with nuisance factors such as lighting and texture, making learned representations vulnerable to task-irrelevant visual variation. We propose JOPAT, a JOint Pixel-And-Track World-Action Model that predicts latent visual observations, 2D point tracks with visibility, and actions in a single denoising diffusion transformer. The key insight is that tracks provide ...

Submitted: May 25, 2026Subjects: Robotics; Robotics

Description / Details

Robot policy learning benefits from world-action models that capture environment dynamics, but pixel-level prediction entangles dynamics with nuisance factors such as lighting and texture, making learned representations vulnerable to task-irrelevant visual variation. We propose JOPAT, a JOint Pixel-And-Track World-Action Model that predicts latent visual observations, 2D point tracks with visibility, and actions in a single denoising diffusion transformer. The key insight is that tracks provide an explicit representation of motion that captures long-horizon dynamics and remains robust under occlusion or partial out-of-frame motion, offering greater utility than modeling pixel appearance alone. On LIBERO and real-world LeRobot tasks, JOPAT improves over pixel-based baselines, with the largest gains on long-horizon tasks involving occlusion, object interaction, and off-screen motion.


Source: arXiv:2605.23856v1 - http://arxiv.org/abs/2605.23856v1 PDF: https://arxiv.org/pdf/2605.23856v1 Original Link: http://arxiv.org/abs/2605.23856v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
May 25, 2026
Topic:
Robotics
Area:
Robotics
Comments:
0
Bookmark
Point Tracking Improves World Action Models | Researchia