Rafael Rafailov @ NeurIPS @rm_rafailov X Profile

Rafael Rafailov @ NeurIPS

@rm_rafailov

Followers

7K

Following

3K

Media

122

Statuses

1K

Ph.D. Student at @StanfordAILab. I work on Foundation Models and Decision Making. Previously @GoogleDeepMind @UCBerkeley

Stanford, CA

Joined May 2023

Don't wanna be here? Send us removal request.

Rafael Rafailov @ NeurIPS

@rm_rafailov

6 months

We have a new position paper on "inference time compute" and what we have been working on in the last few months! We present some theory on why it is necessary, how does it work, why we need it and what does it mean for "super" intelligence.

24

228

1K

Rafael Rafailov @ NeurIPS

@rm_rafailov

15 days

Prefil the replay buffer guys.

Jason Weston

@jaseweston

17 days

🌉 Bridging Offline & Online RL for LLMs 🌉.📝: New paper shows on verifiable & non-verifiable tasks:.- Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO

0

19

Rafael Rafailov @ NeurIPS

@rm_rafailov

17 days

It’s the future.

Rylan Schaeffer

@RylanSchaeffer

17 days

Third #ICML2025 paper! What effect will web-scale synthetic data have on future deep generative models?. Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World 🔄. @JoshuaK92829 @ApratimDey2 @MGerstgrasser @rm_rafailov @sanmikoyejo . 1/7

0

14

Rafael Rafailov @ NeurIPS

@rm_rafailov

22 days

RT @synth_labs: Our new method (ALP) monitors solve rates across RL rollouts and applies inverse difficulty penalties during RL training.….

0

8

0

Rafael Rafailov @ NeurIPS

@rm_rafailov

25 days

No way man, one sample is all you need to collapse!.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxesTex

25 days

How is model collapse still debated seriously.Just stop. This is naivete that belongs in 2023.

0

2

Rafael Rafailov @ NeurIPS

@rm_rafailov

1 month

RT @ZiyuX: Check out this work on benchmarking how well LLMs can implement ML research papers into code led by @tianyu_hua !.

0

2

0

Rafael Rafailov @ NeurIPS

@rm_rafailov

1 month

It’s been very surprising how few people understand this.

Yunhao (Robin) Tang

@robinphysics

1 month

Maybe to one's surprise, taking KL estimates as `kl_loss` to minimize does *not* enforce the KL. This implementation, however, is quite common in open source RL repos and recent research papers. In short: grad of an unbiased KL estimate is not an unbiased estimate of KL grad.

0

12

Rafael Rafailov @ NeurIPS

@rm_rafailov

1 month

I make the AI, very nice!.

James Alcorn

@JamesAlcorn94

1 month

congrats @rm_rafailov on your hard-earned acceptance to the USofA as alien of officially extraordinary ability. The alien piece comes as no surprise to your mates of course, but at least the general public now has fair warning and a fighting chance. To celebrate with a fitting

6

0

57

Rafael Rafailov @ NeurIPS

@rm_rafailov

1 month

RT @JamesAlcorn94: congrats @rm_rafailov on your hard-earned acceptance to the USofA as alien of officially extraordinary ability. The alie….

0

2

0

Rafael Rafailov @ NeurIPS

@rm_rafailov

1 month

When we first published our work on this 9 months ago it was rejected for being impractical in realistic cases. Six months later it was rejected for lack of novelty. It’s the way academic publishing goes.

Nathan Lambert

@natolambert

1 month

Another generative / inference-time scaling reward modeling paper. It's the direction things are going.

4

14

154

Rafael Rafailov @ NeurIPS

@rm_rafailov

1 month

(Meta) CoTs are search inside world models (the prompt is the goal specification).

Jon Richens

@jonathanrichens

1 month

Are world models necessary to achieve human-level agents, or is there a model-free short-cut?.Our new #ICML2025 paper tackles this question from first principles, and finds a surprising answer, agents _are_ world models… 🧵

0

3

42

Rafael Rafailov @ NeurIPS

@rm_rafailov

2 months

RT @jaseweston: 🚨 New paper 🚨.J1: Incentivizing Thinking in LLM-as-a-Judge via RL. - Converts judgement task into a verifiable one for both….

0

63

0

Rafael Rafailov @ NeurIPS

@rm_rafailov

2 months

RT @jyangballin: 40% with just 1 try per task: SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified. We built it by synt….

0

132

0

Rafael Rafailov @ NeurIPS

@rm_rafailov

3 months

GenRMs.

hr0nix

@hr0nix

3 months

LLMs trained to evaluate agentic trajectories give us a powerful way to boost agent performance via test-time search. But single-pass value models have their limitations. Can CoT reasoners be a better alternative? We explore this topic in our latest research blogpost.🧵⬇️.

0

12

Rafael Rafailov @ NeurIPS

@rm_rafailov

3 months

RT @aviral_kumar2: At #ICLR25 workshops, my students+collabs will give many orals talks on newer stuff (don't miss!):. - robot VLA RL fine-….

0

5

0

Rafael Rafailov @ NeurIPS

@rm_rafailov

3 months

And again….

🇺🇦 Dzmitry Bahdanau

@DBahdanau

3 months

I am excited to open-source PipelineRL - a scalable async RL implementation with in-flight weight updates. Why wait until your bored GPUs finish all sequences? Just update the weights and continue inference!. Code: Blog:

0

9

Rafael Rafailov @ NeurIPS

@rm_rafailov

3 months

Meta-Search.

Jiayi Pan

@jiayi_pirate

3 months

We explore a new dimension in scaling reasoning models in Adaptive Parallel Reasoning. APR lets LMs learn to orchestrate both serial & parallel compute E2E via supervised training + RL — w/ better efficiency and scalability than long CoT on Countdown. 🧵

1

3

21

Rafael Rafailov @ NeurIPS

@rm_rafailov

3 months

RT @SurajNair_1: Since the first year of my PhD, every talk I’ve given has opened with a slide about the distant north star: dropping a rob….

0

5

0

Rafael Rafailov @ NeurIPS

@rm_rafailov

3 months

RT @agarwl_: Post-training is going to become training.

0

17

0

Rafael Rafailov @ NeurIPS

@rm_rafailov

3 months

It strikes again.

Prime Intellect

@PrimeIntellect

3 months

Asynchronous RL completely eliminates communication bottlenecks. Our ablation studies confirm we maintain performance even with 4-step delays, making decentralized training viable with weak global interconnects.

0

1

18

Rafael Rafailov @ NeurIPS

@rm_rafailov

3 months

“We developed a fully asynchronous online RL training framework that enhanced flexibility. …. This innovation resulted in a ~10x improvement in training efficiency over previous generations.” Asynch distributed RL strikes again!.

1

4

65