maxrumpf Profile Banner
Max Rumpf Profile
Max Rumpf

@maxrumpf

Followers
2K
Following
2K
Media
123
Statuses
1K

co-founder/ceo https://t.co/RAAaSu0RMJ | xresearcher @ETH @ycombinator

San Francisco, CA
Joined December 2022
Don't wanna be here? Send us removal request.
@levie
Aaron Levie
11 days
We will soon get to a point, as AI model progress continues, that almost any time something doesn’t work with an AI agent in a reasonably sized task, you will be able to point to a lack of the right information that the agent had access to. This is why context engineering is
77
59
543
@maxrumpf
Max Rumpf
9 days
the best the future could muster a power outage in SF. sad.
0
0
2
@maxrumpf
Max Rumpf
9 days
another reasonable explanation (we also do multi-turn, tool-use heavy RL:
@ShengjieWa34067
Shengjie Wang
10 days
@srush_nlp Yeah, in multi-turn RL experiments, we actually see pass@N increase with the number of training steps. Maybe you can take a look at our discussion.
0
0
3
@maxrumpf
Max Rumpf
10 days
We improve both pass@1 AND pass@n during training. The issue is that lots of claimants: 1) train on domains with heavy mid/posttraining in the base models (math) 2) don't train for very long In many of these small-scale experiments, gains come from re-learning the format
@srush_nlp
Sasha Rush
11 days
There is significant discussion in the academic literature about RL making models better at pass@1 and *worse* at pass@N (or related claims). We run a lot of RL runs at Cursor and don't see this issue systematically. Not doubting it occurs, but something else might be going on.
7
9
99
@maxrumpf
Max Rumpf
10 days
Label noise really matters in RL SID-1's task requires reporting the documents most likely to contain the answer to a question. When the ground truth data contains errors, the model will start overreporting in hopes of catching spurious targets. For one public dataset where
@SID_AI
SID
24 days
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
0
1
23
@maxrumpf
Max Rumpf
13 days
Most RL frameworks are fundamentally unstable. We wasted more H100 hours on debugging this than any other issue fornour multi-turn, multi-env RL run (below). When using OpenAI-style messages for env interactions, parsing and retokenizing leads to subtly different tokens. This
@SID_AI
SID
24 days
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
18
56
544
@maxrumpf
Max Rumpf
16 days
Good RL environments are much richer than you think. We evaluate training for 100 epochs and see eval reward increase steadily. Partly, this is because our RL setting allows obfuscating the answer between epochs, largely mitigating memorization (when inspecting train rollouts).
@SID_AI
SID
24 days
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
6
18
146
@SID_AI
SID
24 days
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
18
37
372
@maxrumpf
Max Rumpf
17 days
We believe retrieval is the ideal playground for self-play RL on LLMs. SID-1 was trained with "pseudo self-play:" New questions were generated to cover gaps in model behavior as training progressed. We think we're not far away from closing that loop: Generating hard, verifiable
@SID_AI
SID
24 days
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
6
11
71
@maxrumpf
Max Rumpf
17 days
This is such a sad showing from Zoom. A note from someone who has *trained* a SOTA LLM. Let me explain: @Zoom strung together API calls to Gemini, GPT, Claude et al. and slightly improved on a benchmark that delivers no value for their customers. They then claim SOTA. The
@Zoom
Zoom
18 days
Zoom achieved a new state-of-the-art (SOTA) result on Humanity’s Last Exam (HLE): 48.1% — outperforming other AI models with a 2.3% jump over the previous SOTA. ✨ HLE is one of the most rigorous tests in AI, built to measure real expert-level knowledge and deep reasoning across
29
26
595
@maxrumpf
Max Rumpf
18 days
it's fascinating how small hparam changes will lead to vastly different behaviors, strategies, and personalities in models we have no idea why
1
0
8
@maxrumpf
Max Rumpf
19 days
OpenAI-style messages will make your RL run collapse! This explains behavior seen in Search-R1 and also during training of SID-1. Fix: Handle environment responses directly in tokens. This is NOT the case in veRL. From the report: >Parsing a token list to a message list is
@SID_AI
SID
24 days
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
0
0
7
@maxrumpf
Max Rumpf
19 days
SID-1 knows what it needs to know at all times.
@SID_AI
SID
24 days
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
0
1
5
@maxrumpf
Max Rumpf
19 days
first nvidia-smi in space
@AdiOltean
Adi Oltean
19 days
We have just used the @Nvidia H100 onboard Starcloud-1 to train the first LLM in space! We trained the nano-GPT model from Andrej @Karpathy on the complete works of Shakespeare and successfully ran inference on it. We have also run inference on a preloaded Gemma model, and we
0
0
4
@maxrumpf
Max Rumpf
20 days
1e-6 is the best lr for grpo
@kalomaze
kalomaze
21 days
@Sauers_ 1e-4 is aggressive, start with 1e-6 and increase as necessary, rollouts of 8 is fine, don't overdo WD especially if fft, that's basically it
1
0
2
@maxrumpf
Max Rumpf
21 days
ETH Zurich has 10,000 H100s
@ylecun
Yann LeCun
22 days
@FrancoisChauba1 @agupta Total BS. NYU has the largest GPU cluster of all US academic institutions (500 H200. Bigger than Princeton's) and doesn't even appear on this graph.
0
0
4
@maxrumpf
Max Rumpf
24 days
it was a privilege to work on this model! it's also great to finally be able to share some of the details in our tech report! we've been working on retrieval for a while. this is the way it should always have worked. the results speak for themselves.
@SID_AI
SID
24 days
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
1
0
5
@maxrumpf
Max Rumpf
24 days
length bias removal in common GRPO settings (Dr. GRPO, Mistral) is unstable and leads to collapse over time. we first observe this during training SID-1 and then prove mathematically that this bound to happen. this might explain why DeepSeek did not remove the "length bias"!
2
3
18
@maxrumpf
Max Rumpf
1 month
git blame for google docs would light the world on fire
0
0
3
@maxrumpf
Max Rumpf
1 month
"AI written" is the ultimate insult for a piece of writing. if the author admits an llm wrote it, they seem lazy. if the author insists they wrote it, they seem dull. lazy seems more palatable.
0
0
4