Concise Proximal Policy Optimization

by **hbyte** » Thu Feb 16, 2023 6:17 pm

PPO is the Reinforcement learning Technique being used by OpenAI etc.

As far as I can tell it is a method by which the updates to the model / nn is made using a ratio r(policy) between current and previous policys. This I take to mean as the outputs of the model / nn for a given state before and after being updated.

The policy for any given state is the output of the model at time (t) this output is a Q value array for each action. Therefore the ratio is between two arrays.

This is then restricted / bounded to within limits - making it remain closer / proximal to existing policys.

Then it is multiplied by the Advantage - a measure of how better an action is in the next state as compared to the action for the current state.

ratio(policy) = policy net (state) -> Act -> Bellman(next state) / previous policy net (state) -> oldAct -> Bellman(next state)

* Advantage previous(state,oldAct)

Advantage (s,a) = Q(s,a)-V(s)

= Bellman(next state) - Bellman(current state)

V(s) is normally the output of the NN
Q(s,a) is the Qvalue for the state action pair
next state is the state visited after s,a
current state is prior state feed into the NN which delivers a Value

PPO:

r(policy) = probability ratio

r(policy) = policy net (state) -> Act -> Bellman(next state) / previous policy net (state) -> oldAct -> Bellman(next state)

r(policy) = clip(r(policy),1-e,1+e)

Loss/Err = r(policy)*Advantage previous(state,oldAct)

ppo restricts/clips the policy ratio to within 1-e,1+e limits

PPO is descendant from TRPO (Trust Region Policy Optimization)

In TRPO a trust region with respect to the policy ratio is defined by the Kullback–Leibler divergence as follows:

TRPO:

r(policy) = policy net (state) -> Act -> Bellman(next state) / previous policy net (state) -> oldAct -> Bellman(next state)

if DKL(policy(state),previous policy(state))<=lambda

Loss/Err = r(policy)*Advantage previous(state,oldAct)

(policy is output of nn given current state)

by **hbyte** » Fri Aug 09, 2024 1:20 pm

As I delved deeper into the workings of the PPO algorithm I realised that it is based on the Actor Critic model which has in effect two brains

1. Value function which provides reward estimates and is trained on the rewards of the problem.
The Value functions Loss is just the MSE of Reward(S)-V(S).

2. The Policy function which takes States as Inputs and outputs Actions in the form of Probablities or Logits as they are called.

The Policy Function's Weights are updated using the Loss which is calculated as follows:

PPO Summary:

Calculate Advantage using discounted sum of rewards for entire episode starting at ThisState to the end time T<MaxT

A(S,A) ==
Sum_of_Reward_to_T_Steps { Discount * Reward for Policy #(S,A) } - V(S)

Rewards are from the environment.
Value function is a Value NN trained alongside the Policy NN

Update the Policy NN and the Value NN using

ratio = Lim(Q(S,A)/Q_preUpdt(S,A),-1 e/+1e) * A(s,a)

In the actual Paper for the PSO its a bit more involved in calculating the Advantage.

Delta(t) = Reward(t) + Discount*V(St+1) - V(St)

Advantage(A) = SumT{ Delta(t) * Discount * Schedule(t) }

Basically we are comparing the Reward gained over a certain fixed time step as a sum with the Reward provided by the Value function at State(t). This difference is the Advantage and is used together with the clipped ratio, see above, to update the Weights of the Policy Neural Network.

by **hbyte** » Mon Sep 02, 2024 9:22 pm

RL

bellQ[nextState] = NetOutput[thisState]

bellQ[nextState] = bellQ[nextState] + 0.5*(reward[thisState][nextState] + discount*Max(NetOutput[NextState])-bellQ[nextState])

This is:
Q(s,a) = Q(s,a) + 0.5*( reward(s,a) + discount*Max(Q(s',a') ) - Q(s,a) )

PPO-RL

Advantage = Q(s,a) - V(s)

Q(s,a) = NetOutput[thisState][nextState or thisAct] (This wont work without discount reward)
= reward[thisState][nextState] + discount*reward[preState][thisState] (Sample 3 states Pre,This,Next)
= Sum[i=1<T]{reward[i] + discount*reward[i-1]} T(0)=thisState (Computed for entire episode)

or use above

Q(s,a) = bellQ[nextState]

V(s) = Sum[i]{ NetOutput[thisState][i] }/Nactions (We dont have a Value NN so use Sum)
= NetOutput[thisState]

Ratio = Log(NetOutput[thisState][nextState]) / Log(Pretraining_NetOutput[thisState][nextState]) (Use epoch training)

Loss = -1 * Max(Clip(Ratio,1+/-0.2)*Advantage[thisState][nextState],Log(NetOutput[thisState][i])*Avantage[thistate][nextState])

Concise Proximal Policy Optimization

Concise Proximal Policy Optimization

Two Brains is better than one

Re: Concise Proximal Policy Optimization

Who is online