Naman Jain Profile
Naman Jain

@StringChaos

Followers
2K
Following
5K
Media
50
Statuses
519

PhD @UCBerkeley ; Research @cursor_ai | Projects - LiveCodeBench, DeepSWE, R2E-Gym, GSO, Syzygy, LMArena Coding | Past: @MetaAI @AWS @MSFTResearch @iitbombay

Berkeley
Joined March 2018
Don't wanna be here? Send us removal request.
@StringChaos
Naman Jain
5 months
Excited to release R2E-Gym - 🔥 8.1K executable environments using synthetic data - 🧠 Hybrid verifiers for enhanced inference-time scaling - 📈 51% success-rate on the SWE-Bench Verified - 🤗 Open Source Data + Models + Trajectories 1/
Tweet media one
15
63
257
@slimshetty_
Manish Shetty
6 days
Quick post-summer GSO update. Several new models are now live on the leaderboard!! 🧵👇
Tweet media one
12
8
68
@grok
Grok
20 days
What do you want to know?
1K
755
4K
@lianapatel_
Liana
10 days
Interested in building and benchmarking deep research systems? Excited to introduce DeepScholar-Bench, a live benchmark for generative research synthesis, from our team at Stanford and Berkeley! 🏆Live Leaderboard https://t.co/gWuylXVlkJ 📚 Paper: https://t.co/BbtsoZHlSh 🛠️
Tweet media one
1
36
166
@stuart_sul
Stuart Sul
20 days
MoE layers can be really slow. When training our coding models @cursor_ai, they ate up 27–53% of training time. So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup. We believe our
Tweet media one
30
101
858
@arithmoquine
henry
28 days
new post. there's a lot in it. i suggest you check it out
Tweet media one
71
187
3K
@huybery
Binyuan Hui
2 months
After three intense months of hard work with the team, we made it! We hope this release can help drive the progress of Coding Agents. Looking forward to seeing Qwen3-Coder continue creating new possibilities across the digital world!
@Alibaba_Qwen
Qwen
2 months
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves
Tweet media one
60
86
982
@j_dekoninck
Jasper Dekoninck
2 months
We just released the evaluation of LLMs on the 2025 IMO on MathArena! Gemini scores best, but is still unlikely to achieve the bronze medal with its 31% score (13/42). 🧵(1/4)
Tweet media one
13
40
222
@WeijiaShi2
Weijia Shi
2 months
Can data owners & LM developers collaborate to build a strong shared model while each retaining data control? Introducing FlexOlmo💪, a mixture-of-experts LM enabling: • Flexible training on your local data without sharing it • Flexible inference to opt in/out your data
@allen_ai
Ai2
2 months
Introducing FlexOlmo, a new paradigm for language model training that enables the co-development of AI through data collaboration. 🧵
9
87
280
@Agentica_
Agentica Project
2 months
It's easy to confuse Best@K vs Pass@K—and we've seen some misconceptions about our results. Our 59% on SWEBench-Verified is Pass@1 with Best@16, not Pass@8/16. Our Pass@8/16 is 67%/71%. So how did we achieve this? DeepSWE generates N candidate solutions. Then, another LLM
Tweet media one
@casper_hansen_
Casper Hansen
2 months
Is it malpractice to report SOTA with pass@8 without using other models at pass@8 or just standard practice at this point? It's clearly not SOTA if it's behind Devstral in a pass@1
1
16
57
@hardmaru
hardmaru
2 months
DeepSWE is a new state-of-the-art open-source software engineering model trained entirely using reinforcement learning, based on Qwen3-32B. https://t.co/W54nEqNTgF Fantastic work from @togethercompute @Agentica_
Tweet media one
@togethercompute
Together AI
2 months
Announcing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. Built in
Tweet media one
13
45
265
@StringChaos
Naman Jain
2 months
Do check out our previously proposed R2E-Gym training environments and hybrid test-time scaling that contribute to this work!
@StringChaos
Naman Jain
5 months
Excited to release R2E-Gym - 🔥 8.1K executable environments using synthetic data - 🧠 Hybrid verifiers for enhanced inference-time scaling - 📈 51% success-rate on the SWE-Bench Verified - 🤗 Open Source Data + Models + Trajectories 1/
Tweet media one
0
0
3
@StringChaos
Naman Jain
2 months
🚀 Introducing DeepSWE: Open-Source SWE Agent We're excited to release DeepSWE, our fully open-source software engineering agent trained with pure reinforcement learning on Qwen3-32B. 📊 The results: 59% on SWE-Bench-Verified with test-time scaling (42.2% Pass@1) - new SOTA
@Agentica_
Agentica Project
2 months
🚀 Introducing DeepSWE 🤖: our fully open-sourced, SOTA software engineering agent trained purely with RL on top of Qwen3-32B. DeepSWE achieves 59% on SWEBench-Verified with test-time scaling (and 42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. 💪DeepSWE
Tweet media one
3
12
77
@alexalbert__
Alex Albert
2 months
Claude is hyped to hear that its small business is getting the public recognition it deserves
Tweet media one
@AnthropicAI
Anthropic
2 months
New Anthropic Research: Project Vend. We had Claude run a small shop in our office lunchroom. Here’s how it went.
Tweet media one
38
18
779
@kexun_zhang
Kexun Zhang
3 months
RLVR is not just about RL, it's more about VR! Particularly for LLM coding, good verifiers (tests) are hard to get! In our latest work, we ask 3 questions: How good are current tests? How do we get better tests? How much does test quality matter? https://t.co/dCXr6mmg3Z
Tweet card summary image
leililab.github.io
We propose HardTestGen, a pipeline for synthesizing high-quality test cases and study how much it improves code evaluation and LLM post-training.
4
16
93
@damekdavis
Damek
3 months
Questions to ask 1. can we see a "commit history?" (only 2 commits in repo) 2. what level of supervision was provided? 3. the paper is 4 dense pages. was it outlined first in a lean friendly way and then formalization took place? 4. which models were used in the formalization?
@morph_labs
Morph
3 months
We are excited to announce Trinity, an autoformalization system for verified superintelligence that we have developed at @morph_labs. We have used it to automatically formalize in Lean a classical result of de Bruijn that the abc conjecture is true almost always.
Tweet media one
5
5
143
@bespokelabsai
Bespoke Labs
3 months
Day 3 of drilling down into popular benchmarks for models/agents. Benchmark #3: LiveCodeBench Developed by researchers at UC Berkeley, MIT, and Cornell, this benchmark evaluates LLM code-generation skills and continually expands with new problems drawn from programming contests
1
3
9
@StringChaos
Naman Jain
3 months
We ran this eval yesterday before price drop 😆🫠 @OpenAI
@slimshetty_
Manish Shetty
3 months
📣 Exciting first GSO leaderboard update! @OpenAI o3 now ranks #1 setting the new SOTA at 8.8%!!
Tweet media one
0
0
7
@adityakanade0
Aditya Kanade
3 months
Introducing Code Researcher - a deep research agent for large systems code and commit history. https://t.co/L9LMjiiIM7 Achieves a 58% crash resolution rate on a benchmark of crashes in the Linux kernel, a complex codebase with 28M LOC & 75K files.
Tweet media one
5
91
467
@StringChaos
Naman Jain
3 months
Ensuring construct validity is becoming increasingly more complex as we move towards more real-world evaluation setups. We should routinely inspect benchmark solutions to ensure intended goal is being met!!
@EpochAIResearch
Epoch AI
3 months
How do reasoning models solve hard math problems? We asked 14 mathematicians to review o3-mini-high’s raw, unsummarized reasoning traces on 29 FrontierMath problems. Here’s what they found:
Tweet media one
0
0
7
@StringChaos
Naman Jain
3 months
@slimshetty_ And huge thanks to @open_phil and @ajeya_cotra for supporting this work. Their blog is a very inspiring take on how to build good evaluations for AI agents (linked below), which guided a lot of this work. Stay tuned for further updates! https://t.co/O6qSP74tmT
Tweet card summary image
openphilanthropy.org
Update, 2/5/25: We recently launched a new RFP focused on improving capability evaluations. The new RFP covers three areas: building GCR-relevant benchmarks, advancing the science of evaluations, and...
0
1
12
@StringChaos
Naman Jain
3 months
Building (robust) optimization benchmarks was more challenging than we expected. Work led by @slimshetty_ who spent weeks understanding problems to ensure hacking scenarios are minimal! Check out paper for more analysis and experiments:
1
0
7