ExplorerRoboticsRobotics
Research PaperResearchia:202603.16077

ReMem-VLA: Empowering Vision-Language-Action Model with Memory via Dual-Level Recurrent Queries

Hang Li

Abstract

Vision-language-action (VLA) models for closed-loop robot control are typically cast under the Markov assumption, making them prone to errors on tasks requiring historical context. To incorporate memory, existing VLAs either retrieve from a memory bank, which can be misled by distractors, or extend the frame window, whose fixed horizon still limits long-term retention. In this paper, we introduce ReMem-VLA, a Recurrent Memory VLA model equipped with two sets of learnable queries: frame-level rec...

Submitted: March 16, 2026Subjects: Robotics; Robotics

Description / Details

Vision-language-action (VLA) models for closed-loop robot control are typically cast under the Markov assumption, making them prone to errors on tasks requiring historical context. To incorporate memory, existing VLAs either retrieve from a memory bank, which can be misled by distractors, or extend the frame window, whose fixed horizon still limits long-term retention. In this paper, we introduce ReMem-VLA, a Recurrent Memory VLA model equipped with two sets of learnable queries: frame-level recurrent memory queries for propagating information across consecutive frames to support short-term memory, and chunk-level recurrent memory queries for carrying context across temporal chunks for long-term memory. These queries are trained end-to-end to aggregate and maintain relevant context over time, implicitly guiding the model's decisions without additional training or inference cost. Furthermore, to enhance visual memory, we introduce Past Observation Prediction as an auxiliary training objective. Through extensive memory-centric simulation and real-world robot experiments, we demonstrate that ReMem-VLA exhibits strong memory capabilities across multiple dimensions, including spatial, sequential, episodic, temporal, and visual memory. ReMem-VLA significantly outperforms memory-free VLA baselines ππ0.5 and OpenVLA-OFT and surpasses MemoryVLA on memory-dependent tasks by a large margin.


Source: arXiv:2603.12942v1 - http://arxiv.org/abs/2603.12942v1 PDF: https://arxiv.org/pdf/2603.12942v1 Original Link: http://arxiv.org/abs/2603.12942v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Mar 16, 2026
Topic:
Robotics
Area:
Robotics
Comments:
0
Bookmark