
Sumeet Motwani
@sumeetrm
Followers
2K
Following
4K
Media
51
Statuses
363
ML PhD at Oxford, Previously CS at UC Berkeley
Bay Area, CA
Joined February 2024
🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data? Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡 > RL on existing datasets saturates very quickly > Reasoning over
10
47
280
@kevinweil Hi, as the owner/maintainer of https://t.co/69gOJM7Ci7, this is a dramatic misrepresentation. GPT-5 found references, which solved these problems, that I personally was unaware of. The 'open' status only means I personally am unaware of a paper which solves it.
22
142
3K
Excited to share our new work on the expressivity of Transformer-based multi-agent systems and understanding the trade-offs in communication, no. of agents, and achievable speedups ✨ Work led by @frisbeemortel; check out his thread for details!
Is there such a thing as too many agents in multi-agent systems? It depends! 🧵 Our work reveals 3 distinct regimes where communication patterns differ dramatically. More on our findings below 👇 (1/7)
0
4
12
I strongly believe that good reasoning work should test on Instruct models, and not just base models. Otherwise, any gains you see are probably from better instruction tuning+ performance already present in the model... Be careful out there, readers.
1
1
21
the one thing this definition misses is reliability; models may be great at solving very hard problems at pass@K, but that is not very reliable for a usable agent. same problem w/ METR - they calculate time horizon at 50% accuracy, which is too low to be useful. pass@K is good
The term “AGI” is currently a vague, moving goalpost. To ground the discussion, we propose a comprehensive, testable definition of AGI. Using it, we can quantify progress: GPT-4 (2023) was 27% of the way to AGI. GPT-5 (2025) is 58%. Here’s how we define and measure it: 🧵
0
1
19
RL with a curriculum can teach new skills, and these lead nice improvements on much harder math benchmarks and entirely OOD reasoning tasks (ReasoningGym/Long-context benchmarks)! Should be exciting to scale this up further
Importantly, our results surpass standard RL training on the same underlying dataset and the instruct model even at a very high pass@k, teaching new long-horizon reasoning capabilities! @YangYue_THU and @_AndrewZhao’s work on studying LLM reasoning capabilities beyond the base
1
0
12
One very interesting result from our work is that RL using a curriculum can teach novel capabilities that you can't elicit otherwise (even at very high pass@k) Curriculum learning is back! @jxmnop
🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data? Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡 > RL on existing datasets saturates very quickly > Reasoning over
3
14
141
I disagree with this idea that RL for LLMs is only capable of mode sharpening. 1) This mode sharpening intuition is probably true for shallow RL training (single-task, fewer steps). 2) However for long RL training with a good curriculum of reasoning tasks, the model can start
@rosinality I’ve always had the following intuition pump for it: If a policy pi can sample at least 1 successful trajectory out of N draws, this means that the distribution of the task (however fuzzy/well defined it is) is comprised in the policy. RL via outcome based reward hence does
11
5
112
We believe that to compete at the frontier, you have to own the full stack: from dirt to intelligence. Today we’re announcing two major unlocks for our mission to AGI: 1. We're partnering with @CoreWeave and have 40,000+ NVIDIA GB300s secured. First capacity comes online
33
48
412
Crazy good :0 transfer from their new long horizon training method, trained on compositions of GSM8K, gives large gains across reasoning gym tasks.
Just tested on some Reasoning Gym domains! h1 on long-horizon GSM transfers to: Propositional logic Instruct model: 22.9% h1 training: 47.1% Graphs (largest island) Instruct model: 15% h1 training: 22.5% Algorithmic (sentence reordering) Instruct: 9.6% h1 training: 18.8%
1
1
16
Reasoning Gym domains - h1 on long-horizon GSM transfers to: Propositional logic Instruct model: 22.9% h1 training: 47.1% Graphs (largest island) Instruct model: 15% h1 training: 22.5% Algorithmic (sentence reordering) Instruct: 9.6% h1 training: 18.8% Algorithmic (manipulate
0
1
3
Just tested on some Reasoning Gym domains! h1 on long-horizon GSM transfers to: Propositional logic Instruct model: 22.9% h1 training: 47.1% Graphs (largest island) Instruct model: 15% h1 training: 22.5% Algorithmic (sentence reordering) Instruct: 9.6% h1 training: 18.8%
🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data? Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡 > RL on existing datasets saturates very quickly > Reasoning over
0
1
24
Emerging from presenting MALT: Improving reasoning with multi-agent LLM training @COLM2025 to share the next work on reasoning: this time, showing that long-horizon reasoning can be significantky improved by curriculum training on chained tasks. Fantastic efforts led by
🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data? Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡 > RL on existing datasets saturates very quickly > Reasoning over
0
2
8
One very interesting result from our work is that RL using a curriculum can teach novel capabilities that you can't elicit otherwise (even at very high pass@k) Curriculum learning is back! @jxmnop
🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data? Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡 > RL on existing datasets saturates very quickly > Reasoning over
3
14
141
Must read paper! 👇 Was super interested in the degree of imrovement in long horizon performance even on pass@k
🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data? Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡 > RL on existing datasets saturates very quickly > Reasoning over
0
1
2
Thanks @ShashwatGoel7!
The new wave of RL papers is finally getting interesting. This one: training models on compositions of short reasoning tasks generalizes to them becoming better at other longer reasoning tasks. Here, you can train on easy gsm data -> gain on hard math. @sumeetrm et al. cooked.
0
0
1
Free lunch to turn your models from short -> long horizon reasoners Congratz on the release! @sumeetrm
🚨How do we improve long-horizon reasoning capabilities by scaling RL with only existing data? Introducing our new paper: "h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning"🫡 > RL on existing datasets saturates very quickly > Reasoning over
1
1
16
New @Microsoft + Princeton + Oxford paper shows shows how to train LLMs for long multi-step reasoning without any new labeled data, by chaining short problems and using outcome-only reinforcement learning with a growing-length curriculum. The big deal is that long-horizon skills
5
40
222