How can I publish a research paper for free?

On Researchia, you can publish research papers, preprints, and science projects instantly and for free — no paywall and no submission fee. Create a free account, go to Explorer, and click "Publish Instantly" to share your work with a global audience.

Where can I find trending research papers?

Researchia Explorer aggregates the latest and most-discussed research papers across AI, Biology, Physics, Engineering, and more. New papers are added daily and ranked by community engagement.

What is a good free alternative to ResearchGate for publishing papers?

Researchia is a free, modern alternative to ResearchGate. You can publish papers instantly, connect with researchers, collaborate on projects, and access an open library of 200M+ scientific records — all without paywalls.

Research PaperResearchia:202512.2557f481

Proximal policy optimization

Dr. Isabella Costa (University of São Paulo)

Abstract

Proximal policy optimization Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent. Specifically, it is a policy gradient method, often used for deep RL when the policy network is very large. == History == The predecessor to PPO, Trust Region Policy Optimization (TRPO), was published in 2015. It addressed the instability issue of another algorithm, the Deep Q-Network (DQN), by using the trust region method to limit the KL divergence betwe...

Submitted: December 25, 2025Subjects: Mathematics; Optimization

Description / Details

Proximal policy optimization

Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent. Specifically, it is a policy gradient method, often used for deep RL when the policy network is very large.

== History == The predecessor to PPO, Trust Region Policy Optimization (TRPO), was published in 2015. It addressed the instability issue of another algorithm, the Deep Q-Network (DQN), by using the trust region method to limit the KL divergence between the old and new policies. However, TRPO uses the Hessian matrix (a matrix of second derivatives) to enforce the trust region, but the Hessian is inefficient for large-scale problems. PPO was published in 2017. It was essentially an approximation of TRPO that does not require computing the Hessian. The KL divergence constraint was approximated by simply clipping the policy gradient. Since 2018, PPO was the default RL algorithm at OpenAI. PPO has been applied to many areas, such as controlling a robotic arm, beating professional players at Dota 2 (OpenAI Five), and playing Atari games.

== TRPO == TRPO, the predecessor of PPO, is an on-policy algorithm. It can be used for environments with either discrete or continuous action spaces. The pseudocode is as follows:

Input: initial policy parameters

      θ
      
        0
      
    
  

{\textstyle \theta _{0}}

, initial value function parameters

      ϕ
      
        0
      
    
  

{\textstyle \phi _{0}}

Hyperparameters: KL-divergence limit

    δ
  

{\textstyle \delta }

, backtracking coefficient

    α
  

{\textstyle \alpha }

, maximum number of backtracking steps

    K
  

{\textstyle K}

for

    k
    =
    0
    ,
    1
    ,
    2
    ,
    …
  

{\textstyle k=0,1,2,\ldots }

do Collect set of trajectories

          D
        
      
      
        k
      
    
    =
    
      {
      
        τ
        
          i
        
      
      }
    
  

{\textstyle {\mathcal {D}}_{k}=\left\{\tau _{i}\right\}}

by running policy

      π
      
        k
      
    
    =
    π
    
      (
      
        θ
        
          k
        
      
      )
    
  

{\textstyle \pi _{k}=\pi \left(\theta _{k}\right)}

in the environment. Compute rewards-to-go

            R
            ^
          
        
      
      
        t
      
    
  

{\textstyle {\hat {R}}_{t}}

. Compute advantage estimates,

            A
            ^
          
        
      
      
        t
      
    
  

{\textstyle {\hat {A}}_{t}}

(using any method of advantage estimation) based on the current value function

      V
      
        
          ϕ
          
            k
          
        
      
    
  

{\textstyle V_{\phi _{k}}}

. Estimate policy gradient as ${\hat {x}}_{k}\approx {\hat {H}}_{k}^{-1}{\hat {g}}_{k}$
where

            H
            ^
          
        
      
      
        k
      
    
  

{\textstyle {\hat {H}}_{k}}

is the Hessian of the sample average KL-divergence. Update the policy by backtracking line search with $\theta _{k+1}=\theta _{k}+\alpha ^{j}{\sqrt {\frac {2\delta }{{\hat {x}}_{k}^{T}{\hat {H}}_{k}{\hat {x}}_{k}}}}{\hat {x}}_{k}$
where

    j
    ∈
    {
    0
    ,
    1
    ,
    2
    ,
    …
    K
    }
  

{\textstyle j\in \{0,1,2,\ldots K\}}

is the smallest value which improves the sample loss and satisfies the sample KL-divergence constraint. Fit value function by regression on mean-squared error: $\phi _{k+1}=\arg \min _{\phi }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\left(V_{\phi }\left(s_{t}\right)-{\hat {R}}_{t}\right)^{2}$
typically via some gradient descent algorithm.

== PPO == The pseudocode is as follows:

Input: initial policy parameters

      θ
      
        0
      
    
  

{\textstyle \theta _{0}}

, initial value function parameters

      ϕ
      
        0
      
    
  

{\textstyle \phi _{0}}

for

    k
    =
    0
    ,
    1
    ,
    2
    ,
    …
  

{\textstyle k=0,1,2,\ldots }

do Collect set of trajectories

          D
        
      
      
        k
      
    
    =
    
      {
      
        τ
        
          i
        
      
      }
    
  

{\textstyle {\mathcal {D}}_{k}=\left\{\tau _{i}\right\}}

by running policy

      π
      
        k
      
    
    =
    π
    
      (
      
        θ
        
          k
        
      
      )
    
  

{\textstyle \pi _{k}=\pi \left(\theta _{k}\right)}

in the environment. Compute rewards-to-go

            R
            ^
          
        
      
      
        t
      
    
  

{\textstyle {\hat {R}}_{t}}

. Compute advantage estimates,

            A
            ^
          
        
      
      
        t
      
    
  

{\textstyle {\hat {A}}_{t}}

(using any method of advantage estimation) based on the current value function

      V
      
        
          ϕ
          
            k
          
        
      
    
  

{\textstyle V_{\phi _{k}}}

. Update the policy by maximizing the PPO-Clip objective: $\theta _{k+1}=\arg \max _{\theta }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\min \left({\frac {\pi _{\theta }\left(a_{t}\mid s_{t}\right)}{\pi _{\theta _{k}}\left(a_{t}\mid s_{t}\right)}}A^{\pi _{\theta _{k}}}\left(s_{t},a_{t}\right),\quad g\left(\epsilon ,A^{\pi _{\theta _{k}}}\left(s_{t},a_{t}\right)\right)\right)$
typically via stochastic gradient ascent with Adam. Fit value function by regression on mean-squared error: $\phi _{k+1}=\arg \min _{\phi }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\left(V_{\phi }\left(s_{t}\right)-{\hat {R}}_{t}\right)^{2}$
typically via some gradient descent algorithm. Like all policy gradient methods, PPO is used for training an RL agent whose actions are determined by a differentiable policy function by gradient ascent. Intuitively, a policy gradient method takes small policy update steps, so the agent can reach higher and higher rewards in expectation. Policy gradient methods may be unstable: A step size that is too big may direct the policy in a suboptimal direction, thus having little possibility of recovery; a step size that is too small lowers the overall efficiency. To solve the instability, PPO implements a clip function that constrains the policy update of an agent from being too large, so that larger step sizes may be used without n...

(Article truncated for display)

Source

This content is sourced from Wikipedia, the free encyclopedia. Read full article on Wikipedia

Proximal policy optimization

Abstract

Description / Details

Proximal policy optimization

Source

Category