Back to Explorer
Research PaperResearchia:202512.2557f481[Optimization > Mathematics]

Proximal policy optimization

Dr. Isabella Costa (University of São Paulo)

Abstract

Proximal policy optimization

Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent. Specifically, it is a policy gradient method, often used for deep RL when the policy network is very large.

== History == The predecessor to PPO, Trust Region Policy Optimization (TRPO), was published in 2015. It addressed the instability issue of another algorithm, the Deep Q-Network (DQN), by using the trust region method to limit the KL divergence between the old and new policies. However, TRPO uses the Hessian matrix (a matrix of second derivatives) to enforce the trust region, but the Hessian is inefficient for large-scale problems. PPO was published in 2017. It was essentially an approximation of TRPO that does not require computing the Hessian. The KL divergence constraint was approximated by simply clipping the policy gradient. Since 2018, PPO was the default RL algorithm at OpenAI. PPO has been applied to many areas, such as controlling a robotic arm, beating professional players at Dota 2 (OpenAI Five), and playing Atari games.

== TRPO == TRPO, the predecessor of PPO, is an on-policy algorithm. It can be used for environments with either discrete or continuous action spaces. The pseudocode is as follows:

Input: initial policy parameters

      θ
      
        0
      
    
  

{\textstyle \theta _{0}}

, initial value function parameters

      ϕ
      
        0
      
    
  

{\textstyle \phi _{0}}

Hyperparameters: KL-divergence limit

    δ
  

{\textstyle \delta }

, backtracking coefficient

    α
  

{\textstyle \alpha }

, maximum number of backtracking steps

    K
  

{\textstyle K}

for

    k
    =
    0
    ,
    1
    ,
    2
    ,
    …
  

{\textstyle k=0,1,2,\ldots }

do Collect set of trajectories

          D
        
      
      
        k
      
    
    =
    
      {
      
        τ
        
          i
        
      
      }
    
  

{\textstyle {\mathcal {D}}_{k}=\left\{\tau _{i}\right\}}

by running policy

      π
      
        k
      
    
    =
    π
    
      (
      
        θ
        
          k
        
      
      )
    
  

{\textstyle \pi _{k}=\pi \left(\theta _{k}\right)}

in the environment. Compute rewards-to-go

            R
            ^
          
        
      
      
        t
      
    
  

{\textstyle {\hat {R}}_{t}}

. Compute advantage estimates,

            A
            ^
          
        
      
      
        t
      
    
  

{\textstyle {\hat {A}}_{t}}

(using any method of advantage estimation) based on the current value function

      V
      
        
          ϕ
          
            k
          
        
      
    
  

{\textstyle V_{\phi _{k}}}

. Estimate policy gradient as x^kH^k1g^k{\hat {x}}_{k}\approx {\hat {H}}_{k}^{-1}{\hat {g}}_{k}
where

            H
            ^
          
        
      
      
        k
      
    
  

{\textstyle {\hat {H}}_{k}}

is the Hessian of the sample average KL-divergence. Update the policy by backtracking line search with θk+1=θk+αj2δx^kTH^kx^kx^k\theta _{k+1}=\theta _{k}+\alpha ^{j}{\sqrt {\frac {2\delta }{{\hat {x}}_{k}^{T}{\hat {H}}_{k}{\hat {x}}_{k}}}}{\hat {x}}_{k}
where

    j
    ∈
    {
    0
    ,
    1
    ,
    2
    ,
    …
    K
    }
  

{\textstyle j\in \{0,1,2,\ldots K\}}

is the smallest value which improves the sample loss and satisfies the sample KL-divergence constraint. Fit value function by regression on mean-squared error: ϕk+1=argminϕ1DkTτDkt=0T(Vϕ(st)R^t)2\phi _{k+1}=\arg \min _{\phi }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\left(V_{\phi }\left(s_{t}\right)-{\hat {R}}_{t}\right)^{2}
typically via some gradient descent algorithm.

== PPO == The pseudocode is as follows:

Input: initial policy parameters

      θ
      
        0
      
    
  

{\textstyle \theta _{0}}

, initial value function parameters

      ϕ
      
        0
      
    
  

{\textstyle \phi _{0}}

for

    k
    =
    0
    ,
    1
    ,
    2
    ,
    …
  

{\textstyle k=0,1,2,\ldots }

do Collect set of trajectories

          D
        
      
      
        k
      
    
    =
    
      {
      
        τ
        
          i
        
      
      }
    
  

{\textstyle {\mathcal {D}}_{k}=\left\{\tau _{i}\right\}}

by running policy

      π
      
        k
      
    
    =
    π
    
      (
      
        θ
        
          k
        
      
      )
    
  

{\textstyle \pi _{k}=\pi \left(\theta _{k}\right)}

in the environment. Compute rewards-to-go

            R
            ^
          
        
      
      
        t
      
    
  

{\textstyle {\hat {R}}_{t}}

. Compute advantage estimates,

            A
            ^
          
        
      
      
        t
      
    
  

{\textstyle {\hat {A}}_{t}}

(using any method of advantage estimation) based on the current value function

      V
      
        
          ϕ
          
            k
          
        
      
    
  

{\textstyle V_{\phi _{k}}}

. Update the policy by maximizing the PPO-Clip objective: θk+1=argmaxθ1DkTτDkt=0Tmin(πθ(atst)πθk(atst)Aπθk(st,at),g(ϵ,Aπθk(st,at)))\theta _{k+1}=\arg \max _{\theta }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\min \left({\frac {\pi _{\theta }\left(a_{t}\mid s_{t}\right)}{\pi _{\theta _{k}}\left(a_{t}\mid s_{t}\right)}}A^{\pi _{\theta _{k}}}\left(s_{t},a_{t}\right),\quad g\left(\epsilon ,A^{\pi _{\theta _{k}}}\left(s_{t},a_{t}\right)\right)\right)
typically via stochastic gradient ascent with Adam. Fit value function by regression on mean-squared error: ϕk+1=argminϕ1DkTτDkt=0T(Vϕ(st)R^t)2\phi _{k+1}=\arg \min _{\phi }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\left(V_{\phi }\left(s_{t}\right)-{\hat {R}}_{t}\right)^{2}
typically via some gradient descent algorithm. Like all policy gradient methods, PPO is used for training an RL agent whose actions are determined by a differentiable policy function by gradient ascent. Intuitively, a policy gradient method takes small policy update steps, so the agent can reach higher and higher rewards in expectation. Policy gradient methods may be unstable: A step size that is too big may direct the policy in a suboptimal direction, thus having little possibility of recovery; a step size that is too small lowers the overall efficiency. To solve the instability, PPO implements a clip function that constrains the policy update of an agent from being too large, so that larger step sizes may be used without n...

(Article truncated for display)

Source

This content is sourced from Wikipedia, the free encyclopedia. Read full article on Wikipedia

Category

Optimization - Mathematics

Submission:12/25/2025
Comments:0 comments
Subjects:Mathematics; Optimization
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Proximal policy optimization | Researchia