Proximal policy optimization
Abstract
Proximal policy optimization
Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent. Specifically, it is a policy gradient method, often used for deep RL when the policy network is very large.
== History == The predecessor to PPO, Trust Region Policy Optimization (TRPO), was published in 2015. It addressed the instability issue of another algorithm, the Deep Q-Network (DQN), by using the trust region method to limit the KL divergence between the old and new policies. However, TRPO uses the Hessian matrix (a matrix of second derivatives) to enforce the trust region, but the Hessian is inefficient for large-scale problems. PPO was published in 2017. It was essentially an approximation of TRPO that does not require computing the Hessian. The KL divergence constraint was approximated by simply clipping the policy gradient. Since 2018, PPO was the default RL algorithm at OpenAI. PPO has been applied to many areas, such as controlling a robotic arm, beating professional players at Dota 2 (OpenAI Five), and playing Atari games.
== TRPO == TRPO, the predecessor of PPO, is an on-policy algorithm. It can be used for environments with either discrete or continuous action spaces. The pseudocode is as follows:
Input: initial policy parameters
θ
0
{\textstyle \theta _{0}}
, initial value function parameters
ϕ
0
{\textstyle \phi _{0}}
Hyperparameters: KL-divergence limit
δ
{\textstyle \delta }
, backtracking coefficient
α
{\textstyle \alpha }
, maximum number of backtracking steps
K
{\textstyle K}
for
k
=
0
,
1
,
2
,
…
{\textstyle k=0,1,2,\ldots }
do Collect set of trajectories
D
k
=
{
τ
i
}
{\textstyle {\mathcal {D}}_{k}=\left\{\tau _{i}\right\}}
by running policy
π
k
=
π
(
θ
k
)
{\textstyle \pi _{k}=\pi \left(\theta _{k}\right)}
in the environment. Compute rewards-to-go
R
^
t
{\textstyle {\hat {R}}_{t}}
. Compute advantage estimates,
A
^
t
{\textstyle {\hat {A}}_{t}}
(using any method of advantage estimation) based on the current value function
V
ϕ
k
{\textstyle V_{\phi _{k}}}
.
Estimate policy gradient as
where
H
^
k
{\textstyle {\hat {H}}_{k}}
is the Hessian of the sample average KL-divergence.
Update the policy by backtracking line search with
where
j
∈
{
0
,
1
,
2
,
…
K
}
{\textstyle j\in \{0,1,2,\ldots K\}}
is the smallest value which improves the sample loss and satisfies the sample KL-divergence constraint.
Fit value function by regression on mean-squared error:
typically via some gradient descent algorithm.
== PPO == The pseudocode is as follows:
Input: initial policy parameters
θ
0
{\textstyle \theta _{0}}
, initial value function parameters
ϕ
0
{\textstyle \phi _{0}}
for
k
=
0
,
1
,
2
,
…
{\textstyle k=0,1,2,\ldots }
do Collect set of trajectories
D
k
=
{
τ
i
}
{\textstyle {\mathcal {D}}_{k}=\left\{\tau _{i}\right\}}
by running policy
π
k
=
π
(
θ
k
)
{\textstyle \pi _{k}=\pi \left(\theta _{k}\right)}
in the environment. Compute rewards-to-go
R
^
t
{\textstyle {\hat {R}}_{t}}
. Compute advantage estimates,
A
^
t
{\textstyle {\hat {A}}_{t}}
(using any method of advantage estimation) based on the current value function
V
ϕ
k
{\textstyle V_{\phi _{k}}}
.
Update the policy by maximizing the PPO-Clip objective:
typically via stochastic gradient ascent with Adam.
Fit value function by regression on mean-squared error:
typically via some gradient descent algorithm.
Like all policy gradient methods, PPO is used for training an RL agent whose actions are determined by a differentiable policy function by gradient ascent.
Intuitively, a policy gradient method takes small policy update steps, so the agent can reach higher and higher rewards in expectation.
Policy gradient methods may be unstable: A step size that is too big may direct the policy in a suboptimal direction, thus having little possibility of recovery; a step size that is too small lowers the overall efficiency.
To solve the instability, PPO implements a clip function that constrains the policy update of an agent from being too large, so that larger step sizes may be used without n...
(Article truncated for display)
Source
This content is sourced from Wikipedia, the free encyclopedia. Read full article on Wikipedia
Category
Optimization - Mathematics