RLHF summary and description

Put your Vanilla code here

RLHF summary and description

Postby hbyte » Wed Aug 07, 2024 9:50 pm

a reward model is trained to produce a scalar reward for each

prompt and output text from more than one llm by

normalizing ranks of each different output for the same prompt from

more than one llm

the output text is rated using human feedback

Reward Model:

Input dim(batch * input_prompt(seqlen)+output_text(seqlen)
Output dim(batch * Scalar Reward)

the reward model is then used alongside a pretrained llm to
train an llm using reinforcement learning

1. policy = llm1 generates text(action) from prompt(observation)
2. reward = reward model + constraint(penalty)

first a penalty if computed when comparing the outputs of each
llm

the penalty is included in the reward for each generated output

tokensLLM = per token probabilitys for prompt

Rkl = KL(tokensLLM1,tokensLLM2)
R0 = Scalar output of Reward Model for Prompt and Generated Text

R = R0 - lamda *Rkl




Image
hbyte
Site Admin
 
Posts: 139
Joined: Thu Aug 13, 2020 6:11 pm

Return to Python and ML

Who is online

Users browsing this forum: No registered users and 1 guest

cron