
Naman Jain
@StringChaos
Followers
2K
Following
5K
Media
50
Statuses
519
PhD @UCBerkeley ; Research @cursor_ai | Projects - LiveCodeBench, DeepSWE, R2E-Gym, GSO, Syzygy, LMArena Coding | Past: @MetaAI @AWS @MSFTResearch @iitbombay
Berkeley
Joined March 2018
Excited to release R2E-Gym - 🔥 8.1K executable environments using synthetic data - 🧠 Hybrid verifiers for enhanced inference-time scaling - 📈 51% success-rate on the SWE-Bench Verified - 🤗 Open Source Data + Models + Trajectories 1/
15
63
257
Quick post-summer GSO update. Several new models are now live on the leaderboard!! 🧵👇
12
8
68
Interested in building and benchmarking deep research systems? Excited to introduce DeepScholar-Bench, a live benchmark for generative research synthesis, from our team at Stanford and Berkeley! 🏆Live Leaderboard https://t.co/gWuylXVlkJ 📚 Paper: https://t.co/BbtsoZHlSh 🛠️
1
36
166
MoE layers can be really slow. When training our coding models @cursor_ai, they ate up 27–53% of training time. So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup. We believe our
30
101
858
After three intense months of hard work with the team, we made it! We hope this release can help drive the progress of Coding Agents. Looking forward to seeing Qwen3-Coder continue creating new possibilities across the digital world!
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves
60
86
982
We just released the evaluation of LLMs on the 2025 IMO on MathArena! Gemini scores best, but is still unlikely to achieve the bronze medal with its 31% score (13/42). 🧵(1/4)
13
40
222
Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data
Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵
9
87
280
DeepSWE is a new state-of-the-art open-source software engineering model trained entirely using reinforcement learning, based on Qwen3-32B. https://t.co/W54nEqNTgF Fantastic work from @togethercompute @Agentica_‼
Announcing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. Built in
13
45
265
Do check out our previously proposed R2E-Gym training environments and hybrid test-time scaling that contribute to this work!
Excited to release R2E-Gym - 🔥 8.1K executable environments using synthetic data - 🧠 Hybrid verifiers for enhanced inference-time scaling - 📈 51% success-rate on the SWE-Bench Verified - 🤗 Open Source Data + Models + Trajectories 1/
0
0
3
🚀 Introducing DeepSWE: Open-Source SWE Agent We're excited to release DeepSWE, our fully open-source software engineering agent trained with pure reinforcement learning on Qwen3-32B. 📊 The results: 59% on SWE-Bench-Verified with test-time scaling (42.2% Pass@1) - new SOTA
🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. 💪DeepSWE
3
12
77
RLVR is not just about RL, it's more about VR! Particularly for LLM coding, good verifiers (tests) are hard to get! In our latest work, we ask 3 questions: How good are current tests? How do we get better tests? How much does test quality matter? https://t.co/dCXr6mmg3Z
leililab.github.io
We propose HardTestGen, a pipeline for synthesizing high-quality test cases and study how much it improves code evaluation and LLM post-training.
4
16
93
Questions to ask 1. can we see a "commit history?" (only 2 commits in repo) 2. what level of supervision was provided? 3. the paper is 4 dense pages. was it outlined first in a lean friendly way and then formalization took place? 4. which models were used in the formalization?
We are excited to announce Trinity, an autoformalization system for verified superintelligence that we have developed at @morph_labs. We have used it to automatically formalize in Lean a classical result of de Bruijn that the abc conjecture is true almost always.
5
5
143
Day 3 of drilling down into popular benchmarks for models/agents. Benchmark #3: LiveCodeBench Developed by researchers at UC Berkeley, MIT, and Cornell, this benchmark evaluates LLM code-generation skills and continually expands with new problems drawn from programming contests
1
3
9
Introducing Code Researcher - a deep research agent for large systems code and commit history. https://t.co/L9LMjiiIM7 Achieves a 58% crash resolution rate on a benchmark of crashes in the Linux kernel, a complex codebase with 28M LOC & 75K files.
5
91
467
Ensuring construct validity is becoming increasingly more complex as we move towards more real-world evaluation setups. We should routinely inspect benchmark solutions to ensure intended goal is being met!!
How do reasoning models solve hard math problems? We asked 14 mathematicians to review o3-mini-high’s raw, unsummarized reasoning traces on 29 FrontierMath problems. Here’s what they found:
0
0
7
@slimshetty_ And huge thanks to @open_phil and @ajeya_cotra for supporting this work. Their blog is a very inspiring take on how to build good evaluations for AI agents (linked below), which guided a lot of this work. Stay tuned for further updates! https://t.co/O6qSP74tmT
openphilanthropy.org
Update, 2/5/25: We recently launched a new RFP focused on improving capability evaluations. The new RFP covers three areas: building GCR-relevant benchmarks, advancing the science of evaluations, and...
0
1
12
Building (robust) optimization benchmarks was more challenging than we expected. Work led by @slimshetty_ who spent weeks understanding problems to ensure hacking scenarios are minimal! Check out paper for more analysis and experiments:
1
0
7