In this note w/ @beenwrekt we look at RL problems with 0/1 rewards, showing that popular methods maximize the average (transformed) probability of correctly answering a prompt x: max_θ 𝔼ₓ h(Prob(correct ∣ x; θ)) for certain functions h. Weirdly, h is arcsin(√t) in GRPO.
10
39
362
Replies
@beenwrekt We also include a short recipe for constructing RL algorithms that target an arbitrary function h based on Bernstein polynomials. We hope this note helps clarify how RL on LLMs can be a significantly simpler than the classical setting. Link to paper: https://t.co/kr8UcCrJeg
1
1
20
@beenwrekt oh btw here are three objectives you can get from popular RL methods (log loss comes from rejection sampling). we are really not sure why you'd prefer one over the other. But seeing this plot makes me wonder about the the magical belief in GRPO.
2
0
16
@beenwrekt The advantage family that we considered in draft 1 is a little messy. I cleaned it up after realizing it can represent any function Z_i = f(R_i, S_i) of the reward R_i and the sum S_i of total rewards with R_i left out. It will appear in version 2.
0
0
4
@damekdavis @beenwrekt @damekdavis @beenwrekt What are the main differences between your results and those in https://t.co/1jOfc7SwBa or blog version https://t.co/qO4FtCntss?
1
0
1
@gh_aminian @beenwrekt Those seem to just be about the clipping loss where one samples from an old policy. We're instead looking at the policy gradient update induced by GRPO and realizing it is an unbiased estimate of a gradient of h(Prob(correct)))
1
0
0
@damekdavis @beenwrekt Great notes! We have some similar insights in supervised learning settings https://t.co/NCXEnBY8A2. We find that different transformations suit different settings.
0
0
3
@damekdavis @beenwrekt arcsin is one of those data scientist hacks that never really gets published anywhere.
1
0
1
@damekdavis @beenwrekt I'd seen arcsin(√t) as a variance-stabilizing transform for proportions (as a hack to apply WLS methods to % data) If GRPO z-scores binary rewards, I guess I shouldn't be *too* surprised to see it here
1
0
1
@damekdavis @beenwrekt The title is extremely misleading. If you are going to focus on “LLM” junk science papers and RL algorithms that have not been replicated outside NLP, then it would be good for the title to reflect this focus.
0
0
1
@damekdavis @beenwrekt You've perfectly defined the objective: max Prob(correct). The problem is that standard architectures (M-4) are chaotic and fail to maximize this. They get trapped in local minima. We just ran the "Bake-Off" on an architecture built for this. Setup: – M-4 (control): a standard
0
0
0
@damekdavis @beenwrekt This approach offers a fascinating perspective on reinforcement learning objectives, especially with the transformation function h. It would be interesting to see how this impacts the development of more robust models in complex environments.
0
0
0