Kilian Lieret @KLieret X Profile

Kilian Lieret

@KLieret

Followers

876

Following

37

Media

27

Statuses

105

Research Software Engineer at Princeton University. AI agents & benchmarks for software engineering.

Princeton

Joined May 2021

Don't wanna be here? Send us removal request.

Kilian Lieret

@KLieret

1 month

Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified!.Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵

12

73

791

Kilian Lieret

@KLieret

13 days

You can find lots of other models evaluated under the same settings at (bash-only leaderboard). You can find our agent implementation at

github.com

The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent

0

7

Kilian Lieret

@KLieret

13 days

The effective cost per instance comes somewhat close to gpt-5-mini. Will have more thorough comparison soon.

1

0

7

Kilian Lieret

@KLieret

13 days

Evaluating on the 500 SWE-bench verified instances cost around $18. With respect to the steps taken to solve a problem, deepseek v3.1 chat maxes out later than other models

1

7

Kilian Lieret

@KLieret

13 days

This is evaluated with mini-swe-agent (common-sense prompts, no tools other than bash, some 100 lines of code for the agent class): We're still working on evaluating some other open source models (including GLM).

github.com

The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent

1

10

Kilian Lieret

@KLieret

13 days

Deepseek v3.1 chat scores 53.8% on SWE-bench verified with mini-SWE-agent. Tends to take more steps to solve problems than others (flattens out after some 125 steps). As a result effective cost is somewhere near GPT-5 mini. Details in 🧵

8

21

157

Kilian Lieret

@KLieret

15 days

Small correction: GPT-5 bar chart should read 65.0%, not 65.2%, sorry (fixed in the blog). So the improvement is actually ever so slightly bigger.

0

1

10

Kilian Lieret

@KLieret

15 days

SWE-bench blog:

0

6

Kilian Lieret

@KLieret

15 days

Our minimal agent:

github.com

The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent

1

0

14

Kilian Lieret

@KLieret

15 days

What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵

19

20

270

Kilian Lieret

@KLieret

22 days

RT @richardcsuwandi: Introducing OpenEvolve x AlgoTune! . Now you can run and benchmark evolutionary coding agents on 100+ algorithm optim….

0

20

0

Kilian Lieret

@KLieret

22 days

RT @_carlosejimenez: Recent open model scores on SWE-bench Bash Only:.🥇Qwen3-Coder 480B/A35B Instruct - 55.40%.🥈Kimi-K2-Instruct - 43.80%.🥉….

0

27

0

Kilian Lieret

@KLieret

22 days

RT @SemiAnalysis_: At the end of the day, the SWE-bench leaderboard on swebench dot com is probably the most clear description of current m….

0

15

0

Kilian Lieret

@KLieret

26 days

We also added a blog post with the exact command to reproduce these numbers:

0

2

Kilian Lieret

@KLieret

27 days

Evaluated with our open source minimal agent that tests LMs in a bare-bones shell environment. Agent is implemented in just some 100 lines! We'll add the results to our swe-bench (bash-only) leaderboard shortly:

github.com

The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent

0

2

Kilian Lieret

@KLieret

27 days

GPT-5-* is also much faster at getting to its peak, so definitely don't let it run longer than 50 steps for cost efficiency.

2

0

5

Kilian Lieret

@KLieret

27 days

Agents succeed fast, but fail slowly, so the average cost per instance depends on the step limits. But one thing is clear: GPT-5 is cheaper than Sonnet 4, and GPT-5 mini is incredibly cost efficient!

1

7

Kilian Lieret

@KLieret

27 days

We evaluated the new GPT models with a minimal agent on SWE-bench verified. GPT-5 scores 65%, mini 60%, nano 35%. Still behind Opus 5 (68%), on par with Sonnet 4 (65%). But a lot cheaper, especially mini! Complete cost breakdown + details in 🧵

5

6

33

Kilian Lieret

@KLieret

27 days

More results in the morning! Run the agent yourself:

github.com

The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent

0

2

Kilian Lieret

@KLieret

27 days

gpt-5-mini delivers software engineering for very cheap! We're seeing 60% on SWE-bench verified with just $18 total using our bare-bones 100 line agent. That's for solving 299/500 GitHub issues! Very fast, too! (1.5h total with 10 workers).

1

2

11

Kilian Lieret

@KLieret

27 days

Everything open source at:

github.com

The 100 line AI agent that solves GitHub issues or helps you in your command line. Radically simple, no huge configs, no giant monorepo—but scores 68% on SWE-bench verified! - SWE-agent/mini-swe-agent

0

2