_carlosejimenez Profile Banner
carlos Profile
carlos

@_carlosejimenez

Followers
1K
Following
7K
Media
11
Statuses
333

i like ai, philosophy, and politics

San Francisco, CA
Joined May 2019
Don't wanna be here? Send us removal request.
@KLieret
Kilian Lieret
21 days
Deepseek v3.1 chat scores 53.8% on SWE-bench verified with mini-SWE-agent. Tends to take more steps to solve problems than others (flattens out after some 125 steps). As a result effective cost is somewhere near GPT-5 mini. Details in 🧵
Tweet media one
8
21
158
@KLieret
Kilian Lieret
22 days
What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵
Tweet media one
19
20
270
@samuelp1002
Samuel Miserendino
27 days
0
2
13
@SemiAnalysis_
SemiAnalysis
1 month
At the end of the day, the SWE-bench leaderboard on swebench dot com is probably the most clear description of current model performance on this benchmark. No "verified" subset, limited tool use (bash only), most scaffolding is open to see. In this benchmark, the Claude 4 Opus
Tweet media one
12
15
275
@_carlosejimenez
carlos
1 month
Recent open model scores on SWE-bench Bash Only: 🥇Qwen3-Coder 480B/A35B Instruct - 55.40% 🥈Kimi-K2-Instruct - 43.80% 🥉gpt-oss-120b - 26.00% See the full leaderboard below! 👇
6
27
221
@KLieret
Kilian Lieret
1 month
We evaluated the new GPT models with a minimal agent on SWE-bench verified. GPT-5 scores 65%, mini 60%, nano 35%. Still behind Opus 5 (68%), on par with Sonnet 4 (65%). But a lot cheaper, especially mini! Complete cost breakdown + details in 🧵
Tweet media one
5
6
33
@AbramovichTalor
Talor Abramovich
1 month
Incredible to see the progress in Offensive Cybersecurity benchmarks!
@terryyuezhuo
Terry Yue Zhuo @ SF 🏖️
1 month
Training Agents without Runtime? Yes, and it works well on Offensive Cybersecurity! Introducing Cyber-Zero, the first approach that trains top-tier open-source cybersecurity agents that achieves comparable accuracy on 300+ CTFs like DeepSeek-V3 and Claude-3.5-Sonnet. What makes
Tweet media one
0
1
5
@KLieret
Kilian Lieret
1 month
Play with gpt-5 in our minimal agent (guide in the 🧵)! gpt-5 really wants to solve anything in one shot, so some prompting adjustments are needed to have it behave like a proper agent. Still likes to cram in a lot into a single step. Full evals tomorrow!
1
4
14
@OfirPress
Ofir Press
1 month
.@_carlosejimenez updated the SWE-bench [Bash only] leaderboard with Qwen3 numbers. Congrats to the team on the great results! Note that these numbers are about 10% lower than the max numbers achievable by each model since we don't allow tools in this leaderboard.
Tweet media one
@Alibaba_Qwen
Qwen
2 months
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves
Tweet media one
3
1
20
@_carlosejimenez
carlos
1 month
Leaderboard: https://t.co/GVvoj66Urr Collab with @jyangballin @KLieret @OfirPress We thank our sponsors (AWS, a16z, modal, among others). Find SWE-bench useful? Please consider sponsoring our research!
1
1
8
@_carlosejimenez
carlos
1 month
We evaluate all LMs using exactly the same environment: mini-SWE-agent. This environment provides a simple, standard, bash shell, with no special tools, retrieval, or helper methods. https://t.co/IvdAOmHhC8
Tweet card summary image
github.com
The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent
1
1
8
@_carlosejimenez
carlos
1 month
While our other leaderboards allow all types of systems (RAG, multi-agent, etc), SWE-bench (bash only) provides a space to isolate and compare the agentic coding capabilities of LMs in an apples-to-apples comparison.
1
0
11
@_carlosejimenez
carlos
1 month
What happens if you compare LMs on SWE-bench without the fancy scaffolds? Our new leaderboard “SWE-bench (bash only)” shows you which LMs are the best at getting the job done with just bash. More on why this is important 👇
Tweet media one
14
25
206
@KLieret
Kilian Lieret
2 months
Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified! Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵
Tweet media one
12
73
792
@OfirPress
Ofir Press
2 months
AGI
Tweet media one
0
1
18
@OfirPress
Ofir Press
2 months
LMs had a really tough time playing real video games from the 90s- so we made a suite of 3 simple games to test specific abilities, including drag-and-dropping, and navigating a maze using the arrow keys. Even on these *extremely* simple games, most frontier LMs fail. Results-->
@a1zhang
Alex Zhang
2 months
On the VideoGameBench website, we've added the "practice games" used in a case study of our paper to understand where VLMs struggle. All of these games are relevant to finer-grained actions in video games (clicking, navigating, dragging a mouse) Go try them out now! mini-🧵
Tweet media one
2
2
19
@SWEbench
SWE-bench
2 months
SWE-agent is now Multimodal! 😎 We're releasing SWE-agent Multimodal, with image-viewing abilities and a full web browser for debugging front-ends. Evaluate your LMs on SWE-bench Multimodal or use it yourself for front-end dev. 🔗➡️
Tweet media one
1
6
15
@AbramovichTalor
Talor Abramovich
2 months
Join me next week at #ICML25, where I will be presenting my first first-author paper –– EnIGMA. EnIGMA, an LM agent for cybersecurity, uses interactive tools for server connection and debugging, achieving state-of-the-art on 3 CTF benchmarks.
3
6
25
@rajko_rad
Rajko Radovanović
3 months
We @a16z just launched the third batch of Open Source AI Grants (cc @mbornstein) 🎉 This round includes projects focused on LLM evaluation, novel reasoning tests, infrastructure, and experimental research at the edge of capability and cognition: • SGLang: High-performance LLM
34
58
502