SWE-bench @SWEbench X Profile

SWE-bench

@SWEbench

Followers

174

Following

35

Media

5

Statuses

23

Official SWE-bench Account. Follow for updates to the SWE-universe

https://t.co/uCK2KAGEvA

Joined May 2025

Don't wanna be here? Send us removal request.

SWE-bench

@SWEbench

1 day

https://t.co/QffIiejSP6

CLS

@ChengleiSi

1 day

@jyangballin @KLieret @_carlosejimenez @OfirPress how do I join SWE-bench slack John

0

1

2

SWE-bench

@SWEbench

1 day

SWE-bench blog site launched! Check out our content + expect more SWE-bench/agent/smith content soon!

0

2

John Yang

@jyangballin

2 months

New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals

31

95

390

Chunyang Chen

@chun_yang_chen

2 months

🏆Glad to know that our #ASE25 paper about automated bug repair using MMLM just got the ACM SIGSOFT Distinguished Paper Award🎉 And it is still ranked top #1 in @SWEbench Mulmimodal Track! Thank Kai, Xiaofei @xfxie312, and Jian for the great work!

Chunyang Chen

@chun_yang_chen

3 months

Excited to announce Kai's latest ASE'25 work, let LLMs not only see bugs, but also fix them: 📄 “Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Repair” 🔗 https://t.co/H44pUSHw92 Ranked #1 on @SWEbench Multimodal!

0

1

6

Ofir Press

@OfirPress

3 months

Congrats to @Zai_org GLM-4.5 on getting the 7th spot on our SWE-bench Verified [Bash Only] leaderboard! w/ @KLieret @_carlosejimenez @jyangballin

2

1

12

Ofir Press

@OfirPress

3 months

Super excited to have @anyscalecompute use mini-swe-agent for their large scale runs! w/ @KLieret @_carlosejimenez @jyangballin

1

4

16

Ofir Press

@OfirPress

4 months

3 out of the top 6 most downloaded datasets on @huggingface are SWE-bench related. Thanks!!! ♥️

1

7

65

carlos

@_carlosejimenez

4 months

Recent open model scores on SWE-bench Bash Only: 🥇Qwen3-Coder 480B/A35B Instruct - 55.40% 🥈Kimi-K2-Instruct - 43.80% 🥉gpt-oss-120b - 26.00% See the full leaderboard below! 👇

6

27

213

Kilian Lieret

@KLieret

4 months

What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵

18

21

271

Kilian Lieret

@KLieret

4 months

Deepseek v3.1 chat scores 53.8% on SWE-bench verified with mini-SWE-agent. Tends to take more steps to solve problems than others (flattens out after some 125 steps). As a result effective cost is somewhere near GPT-5 mini. Details in 🧵

8

22

155

Ofir Press

@OfirPress

5 months

GPT-5 gets 74.9 on SWE-bench. Wonder what the budget per task is.

3

1

17

carlos

@_carlosejimenez

5 months

What happens if you compare LMs on SWE-bench without the fancy scaffolds? Our new leaderboard “SWE-bench (bash only)” shows you which LMs are the best at getting the job done with just bash. More on why this is important 👇

14

27

205

Ofir Press

@OfirPress

5 months

Super exciting to have 3 new open-weight models that all obtain more than 60 on SWE-bench Verified! Looking forward to the results on SWE-bench Multimodal when these models obtain vision capabilities :)

5

2

22

Kilian Lieret

@KLieret

5 months

Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified! Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵

12

76

791

SWE-bench

@SWEbench

5 months

🎉 Congrats @Alibaba_Qwen @huybery @JustinLin610 and the Qwen team! Incredible progress in the last year, love to see Qwen continue championing open models for SWE-bench!

Qwen

@Alibaba_Qwen

5 months

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves

0

2

SWE-bench

@SWEbench

5 months

Doc: https://t.co/j9mJPAYLpt Code:

github.com

SWE-agent takes a GitHub issue and tries to automatically fix it, using your LM of choice. It can also be employed for offensive cybersecurity or competitive coding challenges. [NeurIPS 2024] - Gi...

0

4

SWE-bench

@SWEbench

5 months

SWE-agent is now Multimodal! 😎 We're releasing SWE-agent Multimodal, with image-viewing abilities and a full web browser for debugging front-ends. Evaluate your LMs on SWE-bench Multimodal or use it yourself for front-end dev. 🔗➡️

1

7

15

SWE-bench

@SWEbench

6 months

@refact_ai @allhands_ai @TU_Muenchen This update was brought to you by: @jyangballin @_carlosejimenez @KLieret @OfirPress

0

3

SWE-bench

@SWEbench

6 months

@refact_ai @allhands_ai @TU_Muenchen If you'd like to use SWE-agent Multimodal, our agent system for the Multimodal benchmark, we've released a preview of it here: https://t.co/ufY8nEcDuF

1

0

5

SWE-bench

@SWEbench

6 months

@refact_ai @allhands_ai @TU_Muenchen For more info on SWE-bench Multimodal, see: https://t.co/WsbZYkt1VK For the leaderboard, go to: https://t.co/g2sLxnsjXp

1

0

3