@damekdavis
Damek
1 month
In this note w/ @beenwrekt we look at RL problems with 0/1 rewards, showing that popular methods maximize the average (transformed) probability of correctly answering a prompt x: max_θ 𝔼ₓ h(Prob(correct ∣ x; θ)) for certain functions h. Weirdly, h is arcsin(√t) in GRPO.
10
39
362

Replies

@damekdavis
Damek
1 month
@beenwrekt We also include a short recipe for constructing RL algorithms that target an arbitrary function h based on Bernstein polynomials. We hope this note helps clarify how RL on LLMs can be a significantly simpler than the classical setting. Link to paper: https://t.co/kr8UcCrJeg
1
1
20
@damekdavis
Damek
1 month
@beenwrekt oh btw here are three objectives you can get from popular RL methods (log loss comes from rejection sampling). we are really not sure why you'd prefer one over the other. But seeing this plot makes me wonder about the the magical belief in GRPO.
2
0
16
@damekdavis
Damek
1 month
@beenwrekt The advantage family that we considered in draft 1 is a little messy. I cleaned it up after realizing it can represent any function Z_i = f(R_i, S_i) of the reward R_i and the sum S_i of total rewards with R_i left out. It will appear in version 2.
0
0
4
@gh_aminian
Gholamali Aminian
1 month
@damekdavis @beenwrekt @damekdavis @beenwrekt What are the main differences between your results and those in https://t.co/1jOfc7SwBa or blog version https://t.co/qO4FtCntss?
1
0
1
@damekdavis
Damek
1 month
@gh_aminian @beenwrekt Those seem to just be about the clipping loss where one samples from an old policy. We're instead looking at the policy gradient update induced by GRPO and realizing it is an unbiased estimate of a gradient of h(Prob(correct)))
1
0
0
@GaotangLi
Gaotang Li
1 month
@damekdavis @beenwrekt Great notes! We have some similar insights in supervised learning settings https://t.co/NCXEnBY8A2. We find that different transformations suit different settings.
0
0
3
@asemic_horizon
bookwriting chicken
1 month
@damekdavis @beenwrekt arcsin is one of those data scientist hacks that never really gets published anywhere.
1
0
1
@Bayesprof
Natesh Pillai
1 month
0
0
1
@XTXinverseXTY
Curcio
1 month
@damekdavis @beenwrekt I'd seen arcsin(√t) as a variance-stabilizing transform for proportions (as a hack to apply WLS methods to % data) If GRPO z-scores binary rewards, I guess I shouldn't be *too* surprised to see it here
1
0
1
@GUT_AI_F
GUT-AI Foundation — AI/acc
1 month
@damekdavis @beenwrekt The title is extremely misleading. If you are going to focus on “LLM” junk science papers and RL algorithms that have not been replicated outside NLP, then it would be good for the title to reflect this focus.
0
0
1
@HomoAgnosia
Etienne
29 days
@damekdavis @beenwrekt You've perfectly defined the objective: max Prob(correct). ​The problem is that standard architectures (M-4) are chaotic and fail to maximize this. They get trapped in local minima. ​We just ran the "Bake-Off" on an architecture built for this. ​Setup: – M-4 (control): a standard
0
0
0
@dustin_zeb
Dustin
1 month
@damekdavis @beenwrekt This approach offers a fascinating perspective on reinforcement learning objectives, especially with the transformation function h. It would be interesting to see how this impacts the development of more robust models in complex environments.
0
0
0