Max Rumpf @maxrumpf X Profile

Max Rumpf

@maxrumpf

Followers

2K

Following

2K

Media

123

Statuses

1K

co-founder/ceo https://t.co/RAAaSu0RMJ | xresearcher @ETH @ycombinator

https://t.co/qEBIndau3z

San Francisco, CA

Joined December 2022

Don't wanna be here? Send us removal request.

Aaron Levie

@levie

11 days

We will soon get to a point, as AI model progress continues, that almost any time something doesn’t work with an AI agent in a reasonably sized task, you will be able to point to a lack of the right information that the agent had access to. This is why context engineering is

77

59

543

Max Rumpf

@maxrumpf

9 days

the best the future could muster a power outage in SF. sad.

0

2

Max Rumpf

@maxrumpf

9 days

another reasonable explanation (we also do multi-turn, tool-use heavy RL:

Shengjie Wang

@ShengjieWa34067

10 days

@srush_nlp Yeah, in multi-turn RL experiments, we actually see pass@N increase with the number of training steps. Maybe you can take a look at our discussion.

0

3

Max Rumpf

@maxrumpf

10 days

We improve both pass@1 AND pass@n during training. The issue is that lots of claimants: 1) train on domains with heavy mid/posttraining in the base models (math) 2) don't train for very long In many of these small-scale experiments, gains come from re-learning the format

Sasha Rush

@srush_nlp

11 days

There is significant discussion in the academic literature about RL making models better at pass@1 and *worse* at pass@N (or related claims). We run a lot of RL runs at Cursor and don't see this issue systematically. Not doubting it occurs, but something else might be going on.

7

9

99

Max Rumpf

@maxrumpf

10 days

Label noise really matters in RL SID-1's task requires reporting the documents most likely to contain the answer to a question. When the ground truth data contains errors, the model will start overreporting in hopes of catching spurious targets. For one public dataset where

SID

@SID_AI

24 days

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and

0

1

23

Max Rumpf

@maxrumpf

13 days

Most RL frameworks are fundamentally unstable. We wasted more H100 hours on debugging this than any other issue fornour multi-turn, multi-env RL run (below). When using OpenAI-style messages for env interactions, parsing and retokenizing leads to subtly different tokens. This

SID

@SID_AI

24 days

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and

18

56

544

Max Rumpf

@maxrumpf

16 days

Good RL environments are much richer than you think. We evaluate training for 100 epochs and see eval reward increase steadily. Partly, this is because our RL setting allows obfuscating the answer between epochs, largely mitigating memorization (when inspecting train rollouts).

SID

@SID_AI

24 days

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and

6

18

146

SID

@SID_AI

24 days

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and

18

37

372

Max Rumpf

@maxrumpf

17 days

We believe retrieval is the ideal playground for self-play RL on LLMs. SID-1 was trained with "pseudo self-play:" New questions were generated to cover gaps in model behavior as training progressed. We think we're not far away from closing that loop: Generating hard, verifiable

SID

@SID_AI

24 days

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and

6

11

71

Max Rumpf

@maxrumpf

17 days

This is such a sad showing from Zoom. A note from someone who has *trained* a SOTA LLM. Let me explain: @Zoom strung together API calls to Gemini, GPT, Claude et al. and slightly improved on a benchmark that delivers no value for their customers. They then claim SOTA. The

Zoom

@Zoom

18 days

Zoom achieved a new state-of-the-art (SOTA) result on Humanity’s Last Exam (HLE): 48.1% — outperforming other AI models with a 2.3% jump over the previous SOTA. ✨ HLE is one of the most rigorous tests in AI, built to measure real expert-level knowledge and deep reasoning across

29

26

595

Max Rumpf

@maxrumpf

18 days

it's fascinating how small hparam changes will lead to vastly different behaviors, strategies, and personalities in models we have no idea why

1

0

8

Max Rumpf

@maxrumpf

19 days

OpenAI-style messages will make your RL run collapse! This explains behavior seen in Search-R1 and also during training of SID-1. Fix: Handle environment responses directly in tokens. This is NOT the case in veRL. From the report: >Parsing a token list to a message list is

SID

@SID_AI

24 days

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and

0

7

Max Rumpf

@maxrumpf

19 days

SID-1 knows what it needs to know at all times.

SID

@SID_AI

24 days

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and

0

1

5

Max Rumpf

@maxrumpf

19 days

first nvidia-smi in space

Adi Oltean

@AdiOltean

19 days

We have just used the @Nvidia H100 onboard Starcloud-1 to train the first LLM in space! We trained the nano-GPT model from Andrej @Karpathy on the complete works of Shakespeare and successfully ran inference on it. We have also run inference on a preloaded Gemma model, and we

0

4

Max Rumpf

@maxrumpf

20 days

1e-6 is the best lr for grpo

kalomaze

@kalomaze

21 days

@Sauers_ 1e-4 is aggressive, start with 1e-6 and increase as necessary, rollouts of 8 is fine, don't overdo WD especially if fft, that's basically it

1

0

2

Max Rumpf

@maxrumpf

21 days

ETH Zurich has 10,000 H100s

Yann LeCun

@ylecun

22 days

@FrancoisChauba1 @agupta Total BS. NYU has the largest GPU cluster of all US academic institutions (500 H200. Bigger than Princeton's) and doesn't even appear on this graph.

0

4

Max Rumpf

@maxrumpf

24 days

it was a privilege to work on this model! it's also great to finally be able to share some of the details in our tech report! we've been working on retrieval for a while. this is the way it should always have worked. the results speak for themselves.

SID

@SID_AI

24 days

we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and

1

0

5

Max Rumpf

@maxrumpf

24 days

length bias removal in common GRPO settings (Dr. GRPO, Mistral) is unstable and leads to collapse over time. we first observe this during training SID-1 and then prove mathematically that this bound to happen. this might explain why DeepSeek did not remove the "length bias"!

2

3

18

Max Rumpf

@maxrumpf

1 month

git blame for google docs would light the world on fire

0

3

Max Rumpf

@maxrumpf

1 month

"AI written" is the ultimate insult for a piece of writing. if the author admits an llm wrote it, they seem lazy. if the author insists they wrote it, they seem dull. lazy seems more palatable.

0

4