Concise Proximal Policy Optimization
Posted: Thu Feb 16, 2023 6:17 pm
PPO is the Reinforcement learning Technique being used by OpenAI etc.
As far as I can tell it is a method by which the updates to the model / nn is made using a ratio r(policy) between current and previous policys. This I take to mean as the outputs of the model / nn for a given state before and after being updated.
The policy for any given state is the output of the model at time (t) this output is a Q value array for each action. Therefore the ratio is between two arrays.
This is then restricted / bounded to within limits - making it remain closer / proximal to existing policys.
Then it is multiplied by the Advantage - a measure of how better an action is in the next state as compared to the action for the current state.
ratio(policy) = policy net (state) -> Act -> Bellman(next state) / previous policy net (state) -> oldAct -> Bellman(next state)
* Advantage previous(state,oldAct)
Advantage (s,a) = Q(s,a)-V(s)
= Bellman(next state) - Bellman(current state)
V(s) is normally the output of the NN
Q(s,a) is the Qvalue for the state action pair
next state is the state visited after s,a
current state is prior state feed into the NN which delivers a Value
PPO:
r(policy) = probability ratio
r(policy) = policy net (state) -> Act -> Bellman(next state) / previous policy net (state) -> oldAct -> Bellman(next state)
r(policy) = clip(r(policy),1-e,1+e)
Loss/Err = r(policy)*Advantage previous(state,oldAct)
ppo restricts/clips the policy ratio to within 1-e,1+e limits
PPO is descendant from TRPO (Trust Region Policy Optimization)
In TRPO a trust region with respect to the policy ratio is defined by the Kullback–Leibler divergence as follows:
TRPO:
r(policy) = policy net (state) -> Act -> Bellman(next state) / previous policy net (state) -> oldAct -> Bellman(next state)
if DKL(policy(state),previous policy(state))<=lambda
Loss/Err = r(policy)*Advantage previous(state,oldAct)
(policy is output of nn given current state)
As far as I can tell it is a method by which the updates to the model / nn is made using a ratio r(policy) between current and previous policys. This I take to mean as the outputs of the model / nn for a given state before and after being updated.
The policy for any given state is the output of the model at time (t) this output is a Q value array for each action. Therefore the ratio is between two arrays.
This is then restricted / bounded to within limits - making it remain closer / proximal to existing policys.
Then it is multiplied by the Advantage - a measure of how better an action is in the next state as compared to the action for the current state.
ratio(policy) = policy net (state) -> Act -> Bellman(next state) / previous policy net (state) -> oldAct -> Bellman(next state)
* Advantage previous(state,oldAct)
Advantage (s,a) = Q(s,a)-V(s)
= Bellman(next state) - Bellman(current state)
V(s) is normally the output of the NN
Q(s,a) is the Qvalue for the state action pair
next state is the state visited after s,a
current state is prior state feed into the NN which delivers a Value
PPO:
r(policy) = probability ratio
r(policy) = policy net (state) -> Act -> Bellman(next state) / previous policy net (state) -> oldAct -> Bellman(next state)
r(policy) = clip(r(policy),1-e,1+e)
Loss/Err = r(policy)*Advantage previous(state,oldAct)
ppo restricts/clips the policy ratio to within 1-e,1+e limits
PPO is descendant from TRPO (Trust Region Policy Optimization)
In TRPO a trust region with respect to the policy ratio is defined by the Kullback–Leibler divergence as follows:
TRPO:
r(policy) = policy net (state) -> Act -> Bellman(next state) / previous policy net (state) -> oldAct -> Bellman(next state)
if DKL(policy(state),previous policy(state))<=lambda
Loss/Err = r(policy)*Advantage previous(state,oldAct)
(policy is output of nn given current state)