Max Rumpf
@maxrumpf
Followers
2K
Following
2K
Media
123
Statuses
1K
co-founder/ceo https://t.co/RAAaSu0RMJ | xresearcher @ETH @ycombinator
San Francisco, CA
Joined December 2022
We will soon get to a point, as AI model progress continues, that almost any time something doesn’t work with an AI agent in a reasonably sized task, you will be able to point to a lack of the right information that the agent had access to. This is why context engineering is
77
59
543
another reasonable explanation (we also do multi-turn, tool-use heavy RL:
@srush_nlp Yeah, in multi-turn RL experiments, we actually see pass@N increase with the number of training steps. Maybe you can take a look at our discussion.
0
0
3
Label noise really matters in RL SID-1's task requires reporting the documents most likely to contain the answer to a question. When the ground truth data contains errors, the model will start overreporting in hopes of catching spurious targets. For one public dataset where
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
0
1
23
Most RL frameworks are fundamentally unstable. We wasted more H100 hours on debugging this than any other issue fornour multi-turn, multi-env RL run (below). When using OpenAI-style messages for env interactions, parsing and retokenizing leads to subtly different tokens. This
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
18
56
544
Good RL environments are much richer than you think. We evaluate training for 100 epochs and see eval reward increase steadily. Partly, this is because our RL setting allows obfuscating the answer between epochs, largely mitigating memorization (when inspecting train rollouts).
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
6
18
146
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
18
37
372
We believe retrieval is the ideal playground for self-play RL on LLMs. SID-1 was trained with "pseudo self-play:" New questions were generated to cover gaps in model behavior as training progressed. We think we're not far away from closing that loop: Generating hard, verifiable
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
6
11
71
This is such a sad showing from Zoom. A note from someone who has *trained* a SOTA LLM. Let me explain: @Zoom strung together API calls to Gemini, GPT, Claude et al. and slightly improved on a benchmark that delivers no value for their customers. They then claim SOTA. The
Zoom achieved a new state-of-the-art (SOTA) result on Humanity’s Last Exam (HLE): 48.1% — outperforming other AI models with a 2.3% jump over the previous SOTA. ✨ HLE is one of the most rigorous tests in AI, built to measure real expert-level knowledge and deep reasoning across
29
26
595
it's fascinating how small hparam changes will lead to vastly different behaviors, strategies, and personalities in models we have no idea why
1
0
8
OpenAI-style messages will make your RL run collapse! This explains behavior seen in Search-R1 and also during training of SID-1. Fix: Handle environment responses directly in tokens. This is NOT the case in veRL. From the report: >Parsing a token list to a message list is
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
0
0
7
SID-1 knows what it needs to know at all times.
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
0
1
5
first nvidia-smi in space
We have just used the @Nvidia H100 onboard Starcloud-1 to train the first LLM in space! We trained the nano-GPT model from Andrej @Karpathy on the complete works of Shakespeare and successfully ran inference on it. We have also run inference on a preloaded Gemma model, and we
0
0
4
ETH Zurich has 10,000 H100s
@FrancoisChauba1 @agupta Total BS. NYU has the largest GPU cluster of all US academic institutions (500 H200. Bigger than Princeton's) and doesn't even appear on this graph.
0
0
4
it was a privilege to work on this model! it's also great to finally be able to share some of the details in our tech report! we've been working on retrieval for a while. this is the way it should always have worked. the results speak for themselves.
we just released our first model: SID-1 it's designed to be extremely good at only one task: retrieval. it has 1.8x better recall than embedding search alone (even with reranking) and beats "agentic" retrieval implemented using all frontier LLMs, including the really large and
1
0
5
length bias removal in common GRPO settings (Dr. GRPO, Mistral) is unstable and leads to collapse over time. we first observe this during training SID-1 and then prove mathematically that this bound to happen. this might explain why DeepSeek did not remove the "length bias"!
2
3
18
"AI written" is the ultimate insult for a piece of writing. if the author admits an llm wrote it, they seem lazy. if the author insists they wrote it, they seem dull. lazy seems more palatable.
0
0
4