Back to Explorer
Research PaperResearchia:202604.08026[Data Science > Machine Learning]

Value Mirror Descent for Reinforcement Learning

Zhichao Jia

Abstract

Value iteration-type methods have been extensively studied for computing a nearly optimal value function in reinforcement learning (RL). Under a generative sampling model, these methods can achieve sharper sample complexity than policy optimization approaches, particularly in their dependence on the discount factor. In practice, they are often employed for offline training or in simulated environments. In this paper, we consider discounted Markov decision processes with state space S, action space A, discount factor γ(0,1)γ\in(0,1) and costs in [0,1][0,1]. We introduce a novel value optimization method, termed value mirror descent (VMD), which integrates mirror descent from convex optimization into the classical value iteration framework. In the deterministic setting with known transition kernels, we show that VMD converges linearly. For the stochastic setting with a generative model, we develop a stochastic variant, SVMD, which incorporates variance reduction commonly used in stochastic value iteration-type methods. For RL problems with general convex regularizers, SVMD attains a near-optimal sample complexity of O~(SA(1γ)3ε2)\tilde{O}(|S||A|(1-γ)^{-3}ε^{-2}). Moreover, we establish that the Bregman divergence between the generated and optimal policies remains bounded throughout the iterations. This property is absent in existing stochastic value iteration-type methods but is important for enabling effective online (continual) learning following offline training. Under a strongly convex regularizer, SVMD achieves sample complexity of O~(SA(1γ)5ε1)\tilde{O}(|S||A|(1-γ)^{-5}ε^{-1}), improving performance in the high-accuracy regime. Furthermore, we prove convergence of the generated policy to the optimal policy. Overall, the proposed method, its analysis, and the resulting guarantees, constitute new contributions to the RL and optimization literature.


Source: arXiv:2604.06039v1 - http://arxiv.org/abs/2604.06039v1 PDF: https://arxiv.org/pdf/2604.06039v1 Original Link: http://arxiv.org/abs/2604.06039v1

Submission:4/8/2026
Comments:0 comments
Subjects:Machine Learning; Data Science
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Value Mirror Descent for Reinforcement Learning | Researchia